Homework #3: Extreme Interspecies Conservation

CSE 427: Computational Biology
February 1, 2011
due: Sunday, February 20, 10:00 p.m.
Last update: 2/17

The purpose of this assignment is to explore the 46-vertebrate multiple sequence alignment available on the UCSC Human Genome Browser.

One of the fascinating and mysterious discoveries of recent years is what are called "ultraconserved elements", described in a paper by Bejerano et al. These are long DNA sequences that are surprisingly well conserved, some of them all the way from humans to fishes, suggesting that they serve some biological function. Many of them are quite distant from the nearest genes, many of their functions remain a mystery, and no one really understands why these regions are so extremely conserved.

These ultraconserved elements were originally discovered by studying the vertebrate alignments on the UCSC genome browser. At the time of their discovery in 2004, only 6 or so vertebrate genomes were sequenced and available on that browser. What you will do is find similarly extremely conserved elements in the current version of the UCSC genome browser, looking for ones that are well conserved across most of the 46 vertebrate genomes now available. To measure conservation, you will use the phyloP vertebrate basewise conservation track. At a minimum, each of your reported extremely conserved elements should have length at least 100 (that is, consist of at least 100 consecutive nucleotide positions of the human genome) and average phyloP score at least 4.8 (when averaged over all the human positions of your conserved element).

The phyloP scores for the entire human genome are available from the instructional lab linux machines in the directory /projects/instr/11wi/cse427/phylop. In that directory you will find 24 files chr1.phyloP46way.wigFix, chr2.phyloP46way.wigFix, ..., chr22.phyloP46way.wigFix, chrX.phyloP46way.wigFix, and chrY.phyloP46way.wigFix, where the first part of each filename specifies the human chromosome that is aligned.

Each file consists of lines that look like this:

fixedStep chrom=chrY start=13491 step=1
0.985
0.985
-0.257
0.985

Those numbers are the phyloP scores for the coordinates 13491, 13492, 13493, ... on human chromosome Y, one coordinate score per line. Treat any negative phyloP score as though it were 0.

You will find similar lines that start with the word "fixedStep" interspersed throughout each file: whenever you find such a line, it means that phyloP has skipped some human coordinates and is restarting at some later coordinate given by the "start=" portion of this line. (I believe that the skipped coordinates always correspond to human regions that are unaligned to any other animal in the MULTIZ alignment, but all you need to know is to be prepared to parse such "fixedStep" lines in the middle of files.) You do not have to worry about the possibility that one of your extremely conserved elements crosses such a region skipped by phyloP.

Your program will read through all 24 files, searching for extremely conserved elements that meet the criteria given above: length at least 100 bp and average phyloP score at least 4.8. You will find an efficient algorithm for this problem in Theorem 2 of Algorithms for Locating Extremely Conserved Elements in Multiple Sequence Alignments. That paper came about as a result of previous versions of this homework project. For our purposes, the proof of Theorem 2 supports the following stronger statement (and I wish we had worded Theorem 2 in this more general way):
Given q ∈ ℜⁿ and c ∈ ℜ, there is an O(n) time algorithm that maximizes j - i subject to the condition that the average value of q_i+1, q_i+2, ..., q_j is at least c.
See an example of Theorem 2.

You will have to decide what to do if two extremely conserved elements overlap. It would be reasonable to report only the longer of them. But in any case, if one element is contained within another, or they overlap in all but a few columns, report only the longer of them.

For each extremely conserved element that you report, you will also report the 46-vertebrate MULTIZ alignment of that element, as a visual confirmation of its extreme conservation. The entire 46-vertebrate MULTIZ multiple sequence alignment is available from the instructional lab linux machines in the directory /projects/instr/11wi/cse427/multizprocessed . In that directory you will find 24 files chr1.maf, chr2.maf, ..., chr22.maf, chrX.maf, and chrY.maf, where the filename corresponds to the human chromosome that is aligned.

If you search for the first instance of "a score=" in one of these files, this is the first "alignment block". MULTIZ breaks up each human chromosome's alignment into many such alignment blocks. Each alignment block consists of one line that begins "s [species name]" for each sequence that is aligned; these lines form the multiple alignment of this block. Ignore lines beginning with "q" or "i" or "e" (or anything other than "a" or "s"). For example, here is a small block:

a score=1718.000000
s hg19.chr3                      146452387 39 + 199501827 aGCACAGAAAGAAAGCCCTAAACTTAGGAGGTGAGCAGC--
s ponAbe2.chr3                   147941226 39 + 202140232 agcacagaaagaaagccctaaacttaggaggagagcagc--
q ponAbe2.chr3                                            999999999999999999999999999999999999999--
i ponAbe2.chr3                   C 0 I 4
s calJac1.Contig3328                163897 23 +    264433 ------------AAGATTT----CTAGGAGGAGAGCAGT--
q calJac1.Contig3328                                      ------------9999999----9999999999999999--
i calJac1.Contig3328             C 0 I 4
s equCab2.chr16                    8422227 34 -  87365405 AGGGCTGCT---GAGGTGC----CTAGGAGTAGGGCATGAC
q equCab2.chr16                                           999999999---9999999----999999999999999999
i equCab2.chr16                  C 0 C 0
s canFam2.chr23                   11580701 30 -  55389570 AGCACAGCC---AAGCTCC----CTAGG----GGGCACCAC
q canFam2.chr23                                           999999999---9999999----99999----999999999
i canFam2.chr23                  C 0 C 0

The first number on the "s hg19" line (146452387 in the example above) is the human start coordinate for this alignment block. Beware: these coordinates are 1 less than the corresponding coordinates in the phyloP files. It is possible that some of your extremely conserved elements will straddle more than one MULTIZ alignment block.

Choosing some reasonable length interval such as 20 bp, create a histogram with a bar for each length interval (starting at 100 bp), with the height of that bar showing the number of extremely conserved elements found that have length in that interval. Create a second histogram with a bar for each human chromosome, with the height of that bar showing the number of extremely conserved elements found on that chromosome's alignment. Both of these charts can be created easily with a tool such as Excel, and will provide a good visual summary of your results. (For samples of such histograms, see the results from a previous year. Your histograms will differ, because those results were based on quite different criteria for extremely conserved elements.)

Extra credit

For each extremely conserved element reported, also report whether it overlaps an annotated human exon, is contained within an intron, or is intergenic (i.e., between two genes). In the latter two cases, report the distance to the nearest exon. Provide a summary table showing the number of elements that are exonic, intronic, and intergenic.
Investigate how the number and distribution of extremely conserved elements vary as you vary the minimum length (100) and minimum average phyloP score (4.8).
Investigate regions that are extremely conserved across many species except human and whose aligned human sequence is very different from the other species. Studying these may give some hints about human-specific evolution.
Your ideas here.

Turn-in Instructions

Your turn-in will consist of the following files, named as shown.

alignment.txt: a listing of all your reported extremely conserved elements. The first two lines of this file contain your name and the total number of extremely conserved elements reported. For each extremely conserved element, report (a) the human coordinates (for example, chrX:24826217-24826562), using the phyloP file coordinates rather than the MULTIZ file coordinates (which are 1 off), (b) the length of this element in human bp, (c) the average phyloP score and (d) the multiple alignment itself. (Note that these four items should be for just the extremely conserved element, not for the entire MULTIZ alignment block or the entire phyloP scoring block in which you found it.) Label each row of the alignment with the species name. Annotate the alignment by marking each column that has phyloP score greater than 4.8 with a + at the bottom of the column, so that anyone can see at a glance the columns that are particularly well conserved. Results from a previous year provide an example of alignments, although again the instructions that year were quite different.
Please format each reported extremely conserved element in this file as follows:
- a blank line separating this element from the previous one (1 line)
- human coordinates (1 line)
- length (1 line)
- average phyloP score (1 line)
- the alignment (S+1 lines if the MULTIZ alignment contains S species, 1 line for the + annotation)
charts.xls (or other appropriate extension): the two histograms described above.
Source files for your program, with appropriate filenames.
README: a short text file explaining how to compile and run your program.
any files describing extra credit work, whose filenames should begin with the prefix "extra".

Submit these files to the homework drop box at https://catalyst.uw.edu/collectit/dropbox/tompa/13675.