Homework #3: Extreme Interspecies Conservation

CSE 427: Computational Biology
February 2, 2010
due: Friday, February 19
Last update: 2/14

The purpose of this assignment is to explore the 46-vertebrate multiple sequence alignment available on the UCSC Human Genome Browser.

One of the fascinating and mysterious discoveries of recent years is what are called "ultraconserved elements", described in a paper by Bejerano et al. These are long DNA sequences that are surprisingly well conserved all the way from humans to fishes, suggesting that they serve some function. Many of them are quite distant from the nearest genes, many of their functions remain a mystery, and no one really understands why these regions are so extremely conserved.

These ultraconserved elements were originally discovered by studying the vertebrate alignments on the UCSC genome browser. At the time of their discovery in 2004, only 6 or so vertebrate genomes were sequenced and available on that browser. What you will do is find similarly extremely conserved elements in the current version of the UCSC genome browser, looking for ones that are well conserved across (nearly) all 33 placental mammal genomes now available. (You can see which are the placental mammals by looking at the UCSC phylogeny. Be sure to include the primates.)

At a minimum, each of your reported extremely conserved elements should have length at least 300 alignment columns, with at least 90% of the alignment columns identical across at least 30 (of the 33) placental mammal genomes, but the greater these three numbers, the better. An element that is 500 columns long with 95% of the alignment columns identical across all 33 placental mammals would be great. (If, restricting your attention to just 30 of the placental mammals, some column contains the gap character "-" in all 30 species, ignore that column as though it was not there. Thus you need at least 300 columns, at least 90% of which are identical across S ≥ 30 placental mammals, after discarding any column in which those S species all contain the gap character.)

The entire 46-vertebrate MULTIZ multiple sequence alignment is available from the instructional lab linux machines in the directory /projects/instr/10wi/cse427/nobackup. In that directory you will find 24 files chr1.maf, chr2.maf, ..., chr22.maf, chrX.maf, and chrY.maf, where the filename corresponds to the human chromosome that is aligned.

If you search for the first instance of "a score=", this is the first "alignment block". The human chromosome alignment is broken up into many such alignment blocks. You will search each block separately, looking for extremely conserved elements; you do not have to worry about the possibility that an extremely conserved element crosses the boundary between alignment blocks.

Each alignment block consists of one line that begins "s [species name]" for each sequence that is aligned; these lines form the multiple alignment of this block. Ignore lines beginning with "q" or "i" or "e" (or anything other than "a" or "s"). For example, here is a small block:

a score=1718.000000
s hg18.chr3                      146452387 39 + 199501827 aGCACAGAAAGAAAGCCCTAAACTTAGGAGGTGAGCAGC--
s ponAbe2.chr3                   147941226 39 + 202140232 agcacagaaagaaagccctaaacttaggaggagagcagc--
q ponAbe2.chr3                                            999999999999999999999999999999999999999--
i ponAbe2.chr3                   C 0 I 4
s calJac1.Contig3328                163897 23 +    264433 ------------AAGATTT----CTAGGAGGAGAGCAGT--
q calJac1.Contig3328                                      ------------9999999----9999999999999999--
i calJac1.Contig3328             C 0 I 4
s equCab2.chr16                    8422227 34 -  87365405 AGGGCTGCT---GAGGTGC----CTAGGAGTAGGGCATGAC
q equCab2.chr16                                           999999999---9999999----999999999999999999
i equCab2.chr16                  C 0 C 0
s canFam2.chr23                   11580701 30 -  55389570 AGCACAGCC---AAGCTCC----CTAGG----GGGCACCAC
q canFam2.chr23                                           999999999---9999999----99999----999999999
i canFam2.chr23                  C 0 C 0

Your program will read through all the blocks of all 24 files, searching for extremely conserved elements. As a warning, there is 264 GB worth of alignment files you will be processing. But nearly all blocks can be discarded quickly: if a block does not contain at least 30 placental mammals, such as the example given above, or does not have length at least 300 alignment columns, then your program can immediately move on to the next block.

You have some freedom in deciding the algorithm you will use to find extremely conserved elements. In fact, the definition given above (length at least 300 columns, with at least 90% of the alignment columns identical across at least 30 placental mammals) is somewhat ill-defined. For instance, you will have to decide what you want to do if increasing the length from 300 columns to 400 columns decreases the identical columns from 95% to 90% or decreases the number of aligned species with 90% identity from 32 to 31. You will have to decide how to recognize the case that more than 30 placental mammals are aligned but only some subset of 30 of them have the desired percent of identical columns. You will also have to decide what to do if two extremely conserved elements overlap. (It would be reasonable to report only one of them.) But in any case, if one element is contained within another, or they differ from each other in just a few columns, do not report both of them.

You will find an efficient algorithm for this problem in Theorems 1-2 of Algorithms for Locating Extremely Conserved Elements in Multiple Sequence Alignments. The Methods section of that paper presents some ideas for how to handle the possible subsets of the 33 placental mammals. Keep in mind, though, that the filters in that section also handle the case in which a conserved element crosses MULTIZ block boundaries, which you need not handle. That paper came about as a result of previous versions of this homework project.

Note: you no longer have to create the histograms described in this paragraph. Choosing some reasonable length interval such as 20 columns, create a histogram with a bar for each length interval (starting at 300 columns), with the height of that bar showing the number of extremely conserved elements found that have length in that interval. Create a second histogram with a bar for each human chromosome, with the height of that bar showing the number of extremely conserved elements found on that chromosome's alignment. Both of these charts can be created easily with a tool such as Excel, and will provide a good visual summary of your results. (For samples of such histograms, see the results from a previous year. Your histograms will differ, because those results were based on 44 vertebrates rather than 33 placental mammals.)

Finally, you will also report the longest region (even if under 300 columns) that has at least 90% of the alignment columns identical across all 33 placental mammals. This is a simpler problem, and I recommend starting with this. It is a direct application of the algorithm in Theorems 1-2 of Algorithms for Locating Extremely Conserved Elements in Multiple Sequence Alignments.

Extra credit

For each extremely conserved element reported, also report whether it is contained in an annotated human exon, in an intron, or is intergenic (i.e., between two genes). In the latter two cases, report the distance to the nearest exon. Provide a summary table showing the number of elements that are exonic, intronic, and intergenic.
Investigate how the number and distribution of extremely conserved elements varies as you vary the minimum length (300), minimum percent identity (90%), and minimum number of aligned placental mammals (30).
Handle the case that an extremely conserved element crosses the boundary between adjacent MULTIZ alignment blocks.
Investigate regions that are extremely conserved across many species except human and whose aligned human sequence is very different from the other species. Studying these may give some hints about human-specific evolution.
Your ideas here.

Turn-in Instructions

Your turn-in will consist of the following files, named as shown.

description.txt: This short file consists of your name, the total number of extremely conserved elements reported, and a brief paragraph describing the algorithm you used to search for extremely conserved elements. (Explain how you handled the unresolved issues discussed above of increased length vs. decreased percent identity and decreased number of aligned species, more than 30 placental mammals aligned but only a subset of 30 of them with the required identical columns, overlapping extremely conserved elements, etc.)
alignment.txt: a listing of all your reported extremely conserved elements. For each extremely conserved element, report (a) the human coordinates (for example, chrX:24826217-24826562), (b) the number of alignment columns, (c) the number of placental mammals aligned, (d) the percent of columns identical across the aligned placental mammals, and (e) the multiple alignment itself. (Note that these five items should be for just the extremely conserved element, not for the entire MULTIZ alignment block in which you found it. That is, your reported alignment may have fewer columns and fewer species than the MULTIZ alignment block that contains it.) Annotate the alignment by marking each column that is identical across all aligned placental mammals with a * at the bottom of the column, so that anyone can see at a glance which columns are perfectly conserved. Results from a previous year provide an example of alignments with the * annotations.
Please format each reported extremely conserved element in this file as follows:
- human coordinates (1 line)
- number of alignment columns (1 line)
- number S of placental mammals aligned (1 line)
- percent of columns identical across the S aligned species (1 line)
- the alignment (S+1 lines, 1 for the * annotation)
- a blank line separating this element from the next (1 line)
Note: The file charts.xls is no longer required. charts.xls (or other appropriate extension): the two histograms described above.
all33.txt: the single longest alignment that has at least 90% of the alignment columns identical across all 33 placental mammals, formatted as follows:
- human coordinates (1 line)
- number of alignment columns (1 line)
- percent of columns identical across the 33 aligned species (1 line)
- the alignment (34 lines, 1 for the * annotation)
Source files for your program, with appropriate filenames.
README: a short text file explaining how to compile and run your program.
any files describing extra credit work, whose filenames should begin with the prefix "extra".

Submit these files to the homework drop box.