CSE 427, Wi '08: Assignment #1, Due: 2/5/08

University of Washington Computer Science & Engineering

CSE Home

About Us

Contact Info

Adopt-a-Prokaryote: You each will need your very own prokaryote. Visit the NCBI Microbial Genome page and pick one that looks interesting, or has a cute face, subject to the restrictions below. Carefully note its full name (some species have a number of different strains in these tables, such as Bacillus cereus ATCC 10987, Bacillus cereus ATCC 14579, Bacillus cereus E33L) and RefSeq ID.
Find your prokaryote on the NCBI Microbial FTP page and click on its link. Note that the names used here are sometimes slightly different from those used on the NCBI Microbial Genome page, but the RefSeq IDs should match, e.g. NC_002505 for Vibrio cholerae. That directory should have at least one .fna file of size at least 800 KB as listed in the directory, plus at least one .rnt file and at least one .ptt file. Download these three file types to your computer.
Email Ruzzo the name and RefSeq ID of your new pet. I will post everyone's choices HERE. Check this list before you start, so you don't pick someone else's prokaryote. To reduce the chance of collisions, I'd suggest you pick one as follows: click a random column header in the NCBI list to re-sort the table. Then scroll approximately k-tenths of the way down the table, where k is the low order digit of your student ID. Then pick the first one that satisfies the criteria above.
Also: download the corresponding files for Mycobacterium leprae (NC_002677).
Explore: For your very own personal prokaryote, collect the answers to the following questions. (You can easily find the answers to all but the last 2 of these questions by clicking on the name of your prokaryote on the NCBI Microbial Genome page. If any of the answers happen to be missing for your prokaryote, you may just omit them, unless you feel like trying to search for the answer using Google or something similar. While on your prokaryote's NCBI page, scroll to the bottom and read the paragraphs that describe your prokaryote: this will give you a sense of intimacy and pride in your new adoption.)
- Is it an archaeon (A) or a bacterium (B) ?
- Is it pathogenic? If so, what disease does it cause and in what host?
- What is its habitat?
- What is the optimal temperature of its environment, in degrees Celsius?
- What is the size of its genome, measured in Mbp? (If there is more than one chromosome or there are plasmids, add their sizes together.)
- What is its genome's GC content as a percent? (This is the percent of its genome that is either G or C. If there is more than one chromosome or there are plasmids, do the obvious weighted sum to find the overall GC content. Round to the nearest percent.)
- How many proteins does it have?
- What is the average length of its proteins, measured in number of amino acids? Round to the nearest integer.
- What percent of its genome is protein-coding? Round to the nearest percent.
For the average length of its proteins, look at the .ptt files you downloaded earlier. These are the protein tables for your prokaryote. (If there is more than one, that means your prokaryote's genome consists of more than just a single chromosome; use all these files.) Each line gives important information about a different protein. The third column is the length of that protein, measured in number of amino acids. Once you have the average of all these lengths, it is a simple calculation to answer the last question: add 1 to the average (for the stop codon), multiply this by 3 to get the average number of nucleotides occurring in codons, multiply by the number of proteins, and take the ratio of this result to the total genome size.
Long ORFs: Write a program to find all the long open reading frames in your prokaryote's genome. An open reading frame is a contiguous sequence of nucleotide triplets that (a) starts with either of the "start codons" ATG or GTG (a less common, but not rare, start codon in prokaryotes), (b) ends with a stop codon, (c) does not contain any other stop codon that "respects the triplet boundaries", and (d) is as long as possible subject to the conditions (a)-(c). For example, ATGCTAACCTAA qualifies as an open reading frame: its codons would be ATG,CTA,ACC,TAA; the TAA starting at position 5, is not in the correct "reading frame" to be a stop codon, that is, it does not respect the codon boundaries. An open reading frame will be called "long" if it contains at least 125 codons, that is, its length is at least 375 bp. These long open reading frames are simple, but not too inaccurate, predictions of the organism's protein-coding genes.
Notice that there are 6 possible "reading frames": the start codon can start at a position that is either 0, 1, or 2 modulo 3, and it can be on either of the two DNA strands. (You will be given the DNA sequence of only one of the strands, called the + strand. You have to infer the DNA sequence of the other strand, called the - strand: it is simply the reverse complement of the + strand.) It is even possible that two long open reading frames overlap each other, either on the same strand or on opposite strands.
Given a DNA sequence, your program should output the start and end index in that DNA sequence of each long open reading frame. The first character of the input DNA sequence has index 1, not index 0. Since every open reading frame has length a multiple of 3, the difference between the end and start indices will always be 2 modulo 3. (I'm trying to save you from "off by 1" errors.) For open reading frames on the - strand, give the start and end indices on the complementary + strand. That is, the start index will actually be the index of the nucleotide that is base-paired to the last nucleotide of the stop codon, and the end index is the index of the nucleotide that is base-paired to the first nucleotide of the start codon. (It's confusing, but it will get your output in the right format for the comparison to the .ptt files, described below.)
The DNA sequences for your prokaryote (corresponding to the protein tables) are the .fna files that you downloaded earlier. Together, these .fna files represent the whole genome of your prokaryote. Run your program on all these .fna files to produce a list of all the long open reading frames. These are your gene predictions.
You are now going to check your program's gene predictions against the expert annotations. In the .ptt files you downloaded earlier, the first column shows the start and end positions of each gene and the second column shows the strand. Tabulate the following numbers:
- TP = the number of genes for which your program predicted both the start and end positions correctly (True Positives).
- sTP = the number of genes for which your program predicted the stop codon correctly but got the start codon wrong (semi-True Positives). It turns out to be harder to get the start codon right than the stop codon. When calculating sTP, remember that for genes on the - strand this means your program got the start index right and the end index wrong.
- FN = the number of genes in the .ptt files for which your program did not predict either the start or end positions (False Negatives). Most of these will probably be shorter than 125 amino acids long.
- FP = the number of long open reading frames your program found that do not correspond to anything in the .ptt files (False Positives). Most of these will probably be close to 125 amino acids long.
To normalize these numbers so that they are comparable across different genomes, let A be the total number of proteins listed in all your .ptt files, and let B be the number of long open reading frames your program predicted. Calculate the following normalized statistics, rounded to 4 decimal places:
- Sn = TP / A (Sensitivity)
- sSn = (TP + sTP) / A (semi-Sensitivity)
- FOR = FN / A (False Omission Rate)
- PPV = TP / B (Positive Predictive Value)
- sPPV = (TP + sTP) / B (semi-Positive Predictive Value)
- FDR = FP / B (False Discovery Rate)
I'm not particular about programming language; C, C++, Java, Perl, Python, Ruby are all fine. Talk to me if you want to use something else. (E.g., I worry that things like VBA or Matlab, say, might be too slow.)
Extra credit: Investigate what went wrong in your predictions and how you could improve the accuracy of your program.
- For the semi-true positives, how could you refine the definition of open reading frame in order to predict the start codon more accurately? For instance, in E. coli approximately 88% of the start codons are ATG, 10% GTG, and 2% TTG. Can you use these biases to improve the start codon predictions?
- Are there any false negatives whose lengths are at least 125 amino acids? If so, why are they not open reading frames?
- Do some of the false positives and false negatives overlap, so that some algorithmic change would have turned a false positive into a true positive?
- Do many of your false positives have a significant overlap with a true positive, so that you could possibly eliminate them by disallowing long overlaps? But you need some way of deciding which is the true and which the false positive.
- Do many of your semi-true positives have a significant overlap with a true positive or another semi-true positive, so that you could possibly correct the start codon prediction by disallowing long overlaps? But you need some way of deciding which are the correct start codons.
- If you decrease the length threshold below 125, you can probably decrease FN, but investigate the corresponding increase in FP.
- Your ideas here.

Turn-in Instructions

If you experiment with any extra credit enhancements to your program, do not include those enhancements in the basic version you run for the turn-in. Instead, describe your extra credit ideas and any results in separate files, as explained below.

You will actually run your program as described in part (2) above on two separate genomes: once on your personal prokaryote and once on the "community prokaryote" Mycobacterium leprae. For each of these individually, you will produce the list of long open reading frames, and the 10 statistics TP, sTP, FN, FP, Sn, sSn, FOR, PPV, sPPV, and FDR.

Your turn-in will consist of the following files, named as shown. The first line of each of the first 3 files should contain your name and the name of the prokaryote.

community.txt: the results of running on the community prokaryote. After the name of the prokaryote, the next 10 lines should be the statistics TP, sTP, FN, FP, Sn, sSn, FOR, PPV, sPPV, and FDR in that order, one number per line with no labels. The remaining lines provide the list of long open reading frames, one per line, sorted in increasing order of replicon ID, and within one replicon ID sorted in increasing order of start index. (The replicon ID is the portion of the .ptt file name that precedes the .ptt extension. For instance, for the community prokaryote, the file NC_002677.ptt corresponds to replicon ID NC_002677.) Each of these lines consists of 4 items separated by tabs: replicon ID, start index, end index, and strand (denoted by the single character + or -). For example, a line in the community prokaryote's output file might look like this:
```
NC_002677    13076    13744    +
```
personalORF.txt: the results of running on your personal prokaryote. This should just be the list of long open reading frames, in the same format described above.
personalStats.txt: This file will have 20 lines for your personal prokaryote. After the name of the prokaryote, the next 10 lines will be the statistics TP, sTP, FN, FP, Sn, sSn, FOR, PPV, sPPV, and FDR in that order, one number per line with no labels. The remaining 9 lines will be the answers to the questions in part (1) for your personal prokaryote, in the form illustrated in the example below. If you could not find the answer to a particular question, write "N/A" on that line.
Source files for your program, with appropriate filenames.
README: a short text file explaining how to compile and run your program.
any files describing extra credit work, whose filenames should begin with the prefix "extra".

Zip these files together and upload them via the UW Catalyst system, here.

Example of format for part (1) answers. (These happen to be the answers for the community prokaryote.)

B
Leprosy in humans
Host-associated
37
3.2682
57.8
1605
336
50


	Computer Science & Engineering University of Washington Box 352350 Seattle, WA 98195-2350 (206) 543-1695 voice, (206) 543-2969 FAX