Homework #3: Strong Interspecies Conservation

CSE 427: Computational Biology
February 3, 2009
due: Thursday, February 19
Last update: 2/17

The purpose of this assignment is to explore the 44-vertebrate multiple sequence alignments available on the UCSC Human Genome Browser.

One of the fascinating and mysterious discoveries of recent years is what are called "ultraconserved elements", described in a paper by Bejerano et al. These are long DNA sequences that are surprisingly well conserved all the way from humans to fishes, suggesting that they serve some function. Many of them are quite distant from the nearest genes, and most of their functions remain a mystery.

These ultraconserved elements were originally discovered by studying the vertebrate alignments on the UCSC genome browser. At the time of their discovery in 2004, only 6 or so vertebrate genomes were sequenced and available on that browser. What you will do is find such ultraconserved elements in the current version of the UCSC genome browser, looking for ones that are well conserved across (nearly) all 44 vertebrate genomes available there.

At a minimum, each of your reported ultraconserved elements should have length at least 100 bp, with at least 80% of the alignment columns identical across at least 40 (of the 44) genomes, but the greater these three numbers, the better. An element that is 500 bp long with 90% of the alignment columns identical across all 44 genomes would be great. (If, restricting your attention to just 40 of the species, some column contains the gap character "-" in all 40 species, ignore that column as though it was not there. Thus you need at least 100 columns, at least 80% of which are identical across S ≥ 40 species, after discarding any column in which those S species all contain the gap character.)

The entire 44-vertebrate MULTIZ multiple sequence alignment is available from the instructional lab linux machines in the directory /projects/instr/09wi/cse427/multiz44way. In that directory you will find files chr*.maf, where the filename corresponds to the human chromosome that is aligned. You will use only the 24 files chr1.maf, chr2.maf, ..., chr22.maf, chrX.maf, and chrY.maf.

If you search for the first instance of "a score=", this is the first "alignment block". The human chromosome alignment is broken up into many such alignment blocks. You will search each block separately, looking for ultraconserved elements; you do not have to worry about the possibility that an ultraconserved element crosses the boundary between alignment blocks.

Each alignment block consists of one line that begins "s [species name]" for each sequence that is aligned; these lines form the multiple alignment of this block. Ignore lines beginning with "q" or "i" or "e" (or anything other than "a" or "s"). For example, here is a small block taken from chr3.maf:

a score=1718.000000
s hg18.chr3                      146452387 39 + 199501827 aGCACAGAAAGAAAGCCCTAAACTTAGGAGGTGAGCAGC--
s ponAbe2.chr3                   147941226 39 + 202140232 agcacagaaagaaagccctaaacttaggaggagagcagc--
q ponAbe2.chr3                                            999999999999999999999999999999999999999--
i ponAbe2.chr3                   C 0 I 4
s calJac1.Contig3328                163897 23 +    264433 ------------AAGATTT----CTAGGAGGAGAGCAGT--
q calJac1.Contig3328                                      ------------9999999----9999999999999999--
i calJac1.Contig3328             C 0 I 4
s equCab2.chr16                    8422227 34 -  87365405 AGGGCTGCT---GAGGTGC----CTAGGAGTAGGGCATGAC
q equCab2.chr16                                           999999999---9999999----999999999999999999
i equCab2.chr16                  C 0 C 0
s canFam2.chr23                   11580701 30 -  55389570 AGCACAGCC---AAGCTCC----CTAGG----GGGCACCAC
q canFam2.chr23                                           999999999---9999999----99999----999999999
i canFam2.chr23                  C 0 C 0

Your program will read through all the blocks of all 24 files, searching for ultraconserved elements. As a warning, there is 250 GB worth of alignment files you will be processing. But nearly all blocks can be discarded quickly: if a block does not contain at least 40 species, such as the example given above, or does not have length at least 100 alignment columns, then your program can immediately move on to the next block.

You have some freedom in deciding the algorithm you will use to find ultraconserved elements. In fact, the definition given above (length at least 100 bp, with at least 80% of the alignment columns identical across at least 40 species) is somewhat ill-defined. For instance, you will have to decide what you want to do if increasing the length from 100 bp to 200 bp decreases the identical columns from 90% to 85% or decreases the number of aligned species with 80% identity from 43 to 41. You will have to decide how to recognize the case that more than 40 species are aligned but only some subset of 40 of them have the desired percent of identical columns. You will also have to decide what to do if two ultraconserved elements overlap. (It would be reasonable to report only one of them.) But in any case, if one element is contained within another, or they differ from each other in just a few columns, do not report both of them.

Choosing some reasonable length interval such as 20 bp, create a histogram with a bar for each length interval (starting at 100 bp), with the height of that bar showing the number of ultraconserved elements found that have length in that interval. Create a second histogram with a bar for each human chromosome, with the height of that bar showing the number of ultraconserved elements found on that chromosome's alignment. Both of these charts can be created easily with a tool such as Excel, and will provide a good visual summary of your results. (For samples of such histograms, see the results from a previous year. Your histograms will differ, because those results were based on only 17 aligned species.)

Finally, you will also report the longest region (even if under 100 bp) that has at least 80% of the alignment columns identical across all 44 genomes. This is a simpler problem, and I recommend starting with this.

Extra credit

For each ultraconserved element reported, also report whether it is contained in an annotated human exon, in an intron, or is intergenic (i.e., between two genes). In the latter two cases, report the distance to the nearest exon. Provide a summary table showing the number of elements that are exonic, intronic, and intergenic.
Investigate how the number and distribution of ultraconserved elements varies as you vary the minimum length (100), minimum percent identity (80%), and minimum number of aligned species (40).
Handle the case that an ultraconserved element crosses the boundary between adjacent MULTIZ alignment blocks.
Investigate regions that are ultraconserved across many species except human and whose aligned human sequence is very different from the other species. Studying these may give some hints about human-specific evolution.
Your ideas here.

Turn-in Instructions

Your turn-in will consist of the following files, named as shown.

description.txt: This short file consists of your name, the total number of ultraconserved elements reported, and a brief paragraph describing the algorithm you used to search for ultraconserved elements. (Explain how you handled the unresolved issues discussed above of increased length vs. decreased percent identity and decreased number of aligned species, more than 40 species aligned but only a subset of 40 of them with the required identical columns, overlapping ultraconserved elements, etc.)
alignment.txt: a listing of all your reported ultraconserved elements. For each ultraconserved element, report (a) the human coordinates (for example, chrX:24826217-24826562), (b) the number of alignment columns, (c) the number of species aligned, (d) the percent of columns identical across the aligned species, and (e) the multiple alignment itself. (Note that these five items should be for just the ultraconserved element, not for the entire MULTIZ alignment block in which you found it. That is, your reported alignment may have fewer columns and fewer species than the MULTIZ alignment block that contains it.) Annotate the alignment by marking each column that is identical across all aligned species with a * at the bottom of the column, so that anyone can see at a glance which columns are perfectly conserved. Results from a previous year provide some examples of alignments with the * annotations. (Unlike the examples shown, be sure to label each line of your alignment with the species name, either at the beginning or end of that line. You may use the species abbreviation as given on the "s" line of the .maf file if you like.) Please format each reported ultraconserved element in this file as follows:
- human coordinates (1 line)
- number of alignment columns (1 line)
- number S of species aligned (1 line)
- percent of columns identical across the S aligned species (1 line)
- the alignment (a multiple of S+1 lines, 1 for the * annotation)
- a blank line separating this element from the next (1 line)
charts.xls (or other appropriate extension): the two histograms described above.
all44.txt: the single longest alignment that has at least 80% of the alignment columns identical across all 44 genomes, formatted as follows:
- human coordinates (1 line)
- number of alignment columns (1 line)
- percent of columns identical across the 44 aligned species (1 line)
- the alignment (a multiple of 45 lines, 1 for the * annotation)
Source files for your program, with appropriate filenames.
README: a short text file explaining how to compile and run your program.
any files describing extra credit work, whose filenames should begin with the prefix "extra".

Submit these files to the homework drop box at https://catalysttools.washington.edu/collectit/dropbox/lachesis/4587.