Calculating Mean Feature Conservation Score in BED Files

- 3 mins

This code will calculate the mean conservation score on the features in a BED file from the UCSC Genome Browser.

It requires Python and Pandas (pip install pandas).

This also requires that your feature file be a BED-like format - chromosome, start, end, and name positions must follow the BED specification, but additional columns can be appended and preserved. Data from the UCSC Conservation Track (downloaded as single base conservation) should also be converted to one BED-like file per chromosome, which can be done trivially by using the cat -n command in Terminal, adding the chromosome name (in the same style as the feature file) to each line with awk, and ensuring that all whitespace characters are tabs (/t).

A sample featureFile and conservationFile might look something like this:

Now, the code:

  from __future__ import division
  import pandas as pd

  FEATUREFILE = 'filename'
  CONSERVATIONFILEDIR = './conservation/'

  peakDF = pd.read_csv(str(FEATUREFILE), sep = '\t', header=None, names=['chrom','start','end','name','enrichmentVal'])
  #Reject negative peak starts, if they exist (sometimes this can happen w/ MACS)
  peakDF.drop(peakDF[peakDF.start <= 0].index, inplace=True)
  peakDF.drop('index', axis=1, inplace=True)
  peakDF['conservation'] = 1.0

  chromNames = peakDF.chrom.unique()

  for chromosome in chromNames:
    chromSubset = peakDF[peakDF.chrom == str(chromosome)]
    chromDF = pd.read_csv(str(CONSERVATIONFILEDIR) + str(chromosome)+'.bed', sep='\t', header=None, names=['chrom','start','end','conserveScore'])

    for i in xrange(0,len(chromSubset.index)):
	    x = chromDF[chromDF.start >= chromSubset['start'][chromSubset.index[i]]]
	    featureSubset = x[x.start < chromSubset['end'][chromSubset.index[i]]]
	    featureConservation = float(sum(featureSubset.conserveScore)/(chromSubset['end'][chromSubset.index[i]]-chromSubset['start'][chromSubset.index[i]]))
  peakDF.to_csv("", sep = '\t')

This will write a tab-delimited, BED-like file of calculated mean conservation values on each feature in featureFile. It should look something like this:



computational biologist

comments powered by Disqus
rss facebook twitter github youtube mail spotify lastfm instagram linkedin google google-plus pinterest medium vimeo stackoverflow reddit quora quora