The purpose of this assignment is to explore the 44-vertebrate multiple sequence alignments available on the UCSC Human Genome Browser.
One of the fascinating and mysterious discoveries of recent years is what are called "ultraconserved elements", described in a paper by Bejerano et al. These are long DNA sequences that are surprisingly well conserved all the way from humans to fishes, suggesting that they serve some function. Many of them are quite distant from the nearest genes, and most of their functions remain a mystery.
These ultraconserved elements were originally discovered by studying the vertebrate alignments on the UCSC genome browser. At the time of their discovery in 2004, only 6 or so vertebrate genomes were sequenced and available on that browser. What you will do is find such ultraconserved elements in the current version of the UCSC genome browser, looking for ones that are well conserved across (nearly) all 44 vertebrate genomes available there.
At a minimum, each of your reported ultraconserved elements should have length at least 100 bp, with at least 80% of the alignment columns identical across at least 40 (of the 44) genomes, but the greater these three numbers, the better. An element that is 500 bp long with 90% of the alignment columns identical across all 44 genomes would be great. (If, restricting your attention to just 40 of the species, some column contains the gap character "-" in all 40 species, ignore that column as though it was not there. Thus you need at least 100 columns, at least 80% of which are identical across S ≥ 40 species, after discarding any column in which those S species all contain the gap character.)
The entire 44-vertebrate MULTIZ multiple sequence alignment is available from the instructional lab linux machines in the directory /projects/instr/09wi/cse427/multiz44way. In that directory you will find files chr*.maf, where the filename corresponds to the human chromosome that is aligned. You will use only the 24 files chr1.maf, chr2.maf, ..., chr22.maf, chrX.maf, and chrY.maf.
If you search for the first instance of "a score=", this is the first "alignment block". The human chromosome alignment is broken up into many such alignment blocks. You will search each block separately, looking for ultraconserved elements; you do not have to worry about the possibility that an ultraconserved element crosses the boundary between alignment blocks.
Each alignment block consists of one line that begins "s [species name]" for each sequence that is aligned; these lines form the multiple alignment of this block. Ignore lines beginning with "q" or "i" or "e" (or anything other than "a" or "s"). For example, here is a small block taken from chr3.maf:
a score=1718.000000 s hg18.chr3 146452387 39 + 199501827 aGCACAGAAAGAAAGCCCTAAACTTAGGAGGTGAGCAGC-- s ponAbe2.chr3 147941226 39 + 202140232 agcacagaaagaaagccctaaacttaggaggagagcagc-- q ponAbe2.chr3 999999999999999999999999999999999999999-- i ponAbe2.chr3 C 0 I 4 s calJac1.Contig3328 163897 23 + 264433 ------------AAGATTT----CTAGGAGGAGAGCAGT-- q calJac1.Contig3328 ------------9999999----9999999999999999-- i calJac1.Contig3328 C 0 I 4 s equCab2.chr16 8422227 34 - 87365405 AGGGCTGCT---GAGGTGC----CTAGGAGTAGGGCATGAC q equCab2.chr16 999999999---9999999----999999999999999999 i equCab2.chr16 C 0 C 0 s canFam2.chr23 11580701 30 - 55389570 AGCACAGCC---AAGCTCC----CTAGG----GGGCACCAC q canFam2.chr23 999999999---9999999----99999----999999999 i canFam2.chr23 C 0 C 0
Your program will read through all the blocks of all 24 files, searching for ultraconserved elements. As a warning, there is 250 GB worth of alignment files you will be processing. But nearly all blocks can be discarded quickly: if a block does not contain at least 40 species, such as the example given above, or does not have length at least 100 alignment columns, then your program can immediately move on to the next block.
You have some freedom in deciding the algorithm you will use to find ultraconserved elements. In fact, the definition given above (length at least 100 bp, with at least 80% of the alignment columns identical across at least 40 species) is somewhat ill-defined. For instance, you will have to decide what you want to do if increasing the length from 100 bp to 200 bp decreases the identical columns from 90% to 85% or decreases the number of aligned species with 80% identity from 43 to 41. You will have to decide how to recognize the case that more than 40 species are aligned but only some subset of 40 of them have the desired percent of identical columns. You will also have to decide what to do if two ultraconserved elements overlap. (It would be reasonable to report only one of them.) But in any case, if one element is contained within another, or they differ from each other in just a few columns, do not report both of them.
Choosing some reasonable length interval such as 20 bp, create a histogram with a bar for each length interval (starting at 100 bp), with the height of that bar showing the number of ultraconserved elements found that have length in that interval. Create a second histogram with a bar for each human chromosome, with the height of that bar showing the number of ultraconserved elements found on that chromosome's alignment. Both of these charts can be created easily with a tool such as Excel, and will provide a good visual summary of your results. (For samples of such histograms, see the results from a previous year. Your histograms will differ, because those results were based on only 17 aligned species.)
Finally, you will also report the longest region (even if under 100 bp) that has at least 80% of the alignment columns identical across all 44 genomes. This is a simpler problem, and I recommend starting with this.
Your turn-in will consist of the following files, named as shown.
Submit these files to the homework drop box at https://catalysttools.washington.edu/collectit/dropbox/lachesis/4587.