For the average length of its proteins, find your prokaryote on the NCBI Microbial FTP page and click on its link. Download all the .ptt files you find there. These are the protein tables for your prokaryote. (If there is more than one, that means your prokaryote's genome consists of more than just a single chromosome; use all these files.) Each line gives important information about a different protein. The third column is the length of that protein, measured in number of amino acids. Once you have the average of all these lengths, it is a simple calculation to answer the last question: add 1 to the average (for the stop codon), multiply this by 3 to get the average number of nucleotides occurring in codons, multiply by the number of proteins, and take the ratio of this result to the total genome size.
Notice that there are 6 possible "reading frames": the start codon can start at a position that is either 0, 1, or 2 modulo 3, and it can be on either of the two DNA strands. (You will be given the DNA sequence of only one of the strands, called the + strand. You have to infer the DNA sequence of the other strand, called the - strand: it is simply the reverse complement of the + strand.) It is even possible that two long open reading frames overlap each other, either on the same strand or on opposite strands.
Given a DNA sequence, your program should output the start and end index in that DNA sequence of each long open reading frame. The first character of the input DNA sequence has index 1, not index 0. Since every open reading frame has length a multiple of 3, the difference between the end and start indices will always be 2 modulo 3. (I'm trying to save you from "off by 1" errors.) For open reading frames on the - strand, give the start and end indices on the complementary + strand. That is, the start index will actually be the index of the nucleotide that is base-paired to the last nucleotide of the stop codon, and the end index is the index of the nucleotide that is base-paired to the first nucleotide of the start codon. (It's confusing, but it will get your output in the right format for the comparison to the .ptt files, described below.)
Go back to the NCBI Microbial FTP page and download all the .fna files for your prokaryote. These are the DNA sequences corresponding to the protein tables that you downloaded earlier. Together, these .fna files represent the whole genome of your prokaryote. Run your program on all these .fna files to produce a list of all the long open reading frames. These are your gene predictions.
You are now going to check your program's gene predictions against the expert annotations. In the .ptt files you downloaded earlier, the first column shows the start and end positions of each gene and the second column shows the strand. Tabulate the following numbers:
To normalize these numbers so that they are comparable across different genomes, let A be the total number of proteins listed in all your .ptt files, and let B be the number of long open reading frames your program predicted. Calculate the following normalized statistics, rounded to 4 decimal places:
You will actually run your program as described in part (2) above on two separate genomes: once on your personal prokaryote and once on the "community prokaryote" Bartonella quintana. For each of these individually, you will produce the list of long open reading frames, and the 10 statistics TP, sTP, FN, FP, Sn, sSn, FOR, PPV, sPPV, and FDR.
Your turn-in will consist of the following files, named as shown. The first line of each of the first 3 files should contain the name of the prokaryote.
NC_005955 13076 13744 +
Zip these files together and send them to the TA as an attachment. Warning: if you send a .zip attachment from your @u account, it will be removed by the UW mailer, so send it from your @cs account.
B
Trench fever in humans
Host-associated
37
1.58138
38
1142
333
72