The purpose of this assignment is to explore the 46-vertebrate multiple sequence alignment available on the UCSC Human Genome Browser.
One of the fascinating and mysterious discoveries of recent years is what are called "ultraconserved elements", described in a paper by Bejerano et al. These are long DNA sequences that are surprisingly well conserved, some of them all the way from humans to fishes, suggesting that they serve some biological function. Many of them are quite distant from the nearest genes, many of their functions remain a mystery, and no one really understands why these regions are so extremely conserved.
These ultraconserved elements were originally discovered by studying the vertebrate alignments on the UCSC genome browser. At the time of their discovery in 2004, only 6 or so vertebrate genomes were sequenced and available on that browser. What you will do is find similarly extremely conserved elements in the current version of the UCSC genome browser, looking for ones that are well conserved across most of the 46 vertebrate genomes now available. To measure conservation, you will use the phyloP vertebrate basewise conservation track. At a minimum, each of your reported extremely conserved elements should have length at least 100 (that is, consist of at least 100 consecutive nucleotide positions of the human genome) and average phyloP score at least 4.8 (when averaged over all the human positions of your conserved element).
The phyloP scores for the entire human genome are available from the instructional lab linux machines in the directory /projects/instr/11wi/cse427/phylop. In that directory you will find 24 files chr1.phyloP46way.wigFix, chr2.phyloP46way.wigFix, ..., chr22.phyloP46way.wigFix, chrX.phyloP46way.wigFix, and chrY.phyloP46way.wigFix, where the first part of each filename specifies the human chromosome that is aligned.
Each file consists of lines that look like this:
fixedStep chrom=chrY start=13491 step=1 0.985 0.985 -0.257 0.985Those numbers are the phyloP scores for the coordinates 13491, 13492, 13493, ... on human chromosome Y, one coordinate score per line. Treat any negative phyloP score as though it were 0.
You will find similar lines that start with the word "fixedStep" interspersed throughout each file: whenever you find such a line, it means that phyloP has skipped some human coordinates and is restarting at some later coordinate given by the "start=" portion of this line. (I believe that the skipped coordinates always correspond to human regions that are unaligned to any other animal in the MULTIZ alignment, but all you need to know is to be prepared to parse such "fixedStep" lines in the middle of files.) You do not have to worry about the possibility that one of your extremely conserved elements crosses such a region skipped by phyloP.
Your program will read through all 24 files, searching for extremely
conserved elements that meet the criteria given above: length at least
100 bp and average phyloP score at least 4.8. You will find an
efficient algorithm for this problem in Theorem 2 of
Algorithms for
Locating Extremely Conserved Elements in Multiple Sequence
Alignments. That paper came about as a result of previous
versions of this homework project. For our purposes, the proof of
Theorem 2 supports the following stronger statement (and I wish we had
worded Theorem 2 in this more general way):
Given q ∈
ℜn and c ∈ ℜ, there is an O(n) time
algorithm that maximizes j - i subject to the condition that the
average value of qi+1, qi+2, ..., qj
is at least c.
See an example of Theorem 2.
You will have to decide what to do if two extremely conserved elements overlap. It would be reasonable to report only the longer of them. But in any case, if one element is contained within another, or they overlap in all but a few columns, report only the longer of them.
For each extremely conserved element that you report, you will also report the 46-vertebrate MULTIZ alignment of that element, as a visual confirmation of its extreme conservation. The entire 46-vertebrate MULTIZ multiple sequence alignment is available from the instructional lab linux machines in the directory /projects/instr/11wi/cse427/multizprocessed . In that directory you will find 24 files chr1.maf, chr2.maf, ..., chr22.maf, chrX.maf, and chrY.maf, where the filename corresponds to the human chromosome that is aligned.
If you search for the first instance of "a score=" in one of these files, this is the first "alignment block". MULTIZ breaks up each human chromosome's alignment into many such alignment blocks. Each alignment block consists of one line that begins "s [species name]" for each sequence that is aligned; these lines form the multiple alignment of this block. Ignore lines beginning with "q" or "i" or "e" (or anything other than "a" or "s"). For example, here is a small block:
a score=1718.000000 s hg19.chr3 146452387 39 + 199501827 aGCACAGAAAGAAAGCCCTAAACTTAGGAGGTGAGCAGC-- s ponAbe2.chr3 147941226 39 + 202140232 agcacagaaagaaagccctaaacttaggaggagagcagc-- q ponAbe2.chr3 999999999999999999999999999999999999999-- i ponAbe2.chr3 C 0 I 4 s calJac1.Contig3328 163897 23 + 264433 ------------AAGATTT----CTAGGAGGAGAGCAGT-- q calJac1.Contig3328 ------------9999999----9999999999999999-- i calJac1.Contig3328 C 0 I 4 s equCab2.chr16 8422227 34 - 87365405 AGGGCTGCT---GAGGTGC----CTAGGAGTAGGGCATGAC q equCab2.chr16 999999999---9999999----999999999999999999 i equCab2.chr16 C 0 C 0 s canFam2.chr23 11580701 30 - 55389570 AGCACAGCC---AAGCTCC----CTAGG----GGGCACCAC q canFam2.chr23 999999999---9999999----99999----999999999 i canFam2.chr23 C 0 C 0The first number on the "s hg19" line (146452387 in the example above) is the human start coordinate for this alignment block. Beware: these coordinates are 1 less than the corresponding coordinates in the phyloP files. It is possible that some of your extremely conserved elements will straddle more than one MULTIZ alignment block.
Choosing some reasonable length interval such as 20 bp, create a histogram with a bar for each length interval (starting at 100 bp), with the height of that bar showing the number of extremely conserved elements found that have length in that interval. Create a second histogram with a bar for each human chromosome, with the height of that bar showing the number of extremely conserved elements found on that chromosome's alignment. Both of these charts can be created easily with a tool such as Excel, and will provide a good visual summary of your results. (For samples of such histograms, see the results from a previous year. Your histograms will differ, because those results were based on quite different criteria for extremely conserved elements.)
Your turn-in will consist of the following files, named as shown.
Please format each reported extremely conserved element in this file as follows:
Submit these files to the homework drop box at https://catalyst.uw.edu/collectit/dropbox/tompa/13675.