|
CSE Home | About Us | Search | Contact Info |
Prediction of Protein Coding Genes: In this assignment we will look at the basics of gene finding. This write-up is lengthy, but the basic ideas are simple. For a specific prokaryotic genome, you are going to identify all plus-strand open reading frames (ORFs), score each according to a Markov model trained on a subset of the genome, and attempt to classify all plus-strand ORFs as "real" protein-coding genes, or not, based on ORF length, MM score, or a combination of both, and then evaluate the success of the classification based on a trusted annotation of the genome obtained from Genbank. To avoid some tedious bookkeeping, we will make several simplifying assumptions. Details follow.
In lecture, I described 2 basic ideas that are exploited in nearly all gene finding programs.
Idea #1: look for long open reading frames (ORFs).
Definition An ORF is a (maximal length) sequence of consecutive nucleotide triplets, none of which is a stop codon (TAA, TAG, or TGA)."Maximal length" implies that the triplets immediately preceding and immediately following an ORF are stop codons (or that there is no adjacent triplet, as will happen at the beginning and end of the genome sequence). In prokaryotes (bacteria and archea), essentially every protein-coding gene falls at the end of an ORF. (As promised, we are simplifying, principly in that we aren't going to bother to look for the start codon.) ORFs are easy to find. Starting at the leftmost triplet of the genome, scan successive, adjacent, non-overlapping triplets until you find a stop; then continue until you find another stop, etc. Those are all the ORFs whose start positions have indices congruent to 1 mod 3. Then repeat the whole process from the left end again, but omitting the very first, or first two, nucleotides (to account for the other two possible reading frames, congruent to 2 mod 3 and 0 mod 3, resp.). [One should also repeat this from the right end on the Watson-Crick complement of the sequence; again in the interest of simplicity, we're going to completely ignore that.]
The first part of this problem is to write a program to do the above process. We will use the genome sequence of Methanocaldococcus jannaschii, the same one you downloaded for HW 3. Again, ignore the two shorter plasmid sequences at the end of the .fna file, and treat nucleotide symbols other than A, C, G as T. Record the indices of the first and last nucleotide in each ORF you find. (Recall, by convention, genome indices are 1-based, not 0-based.)
Idea #2: Look for ORFs having sequence statistics that are more typical of known genes than of noncoding DNA.
Here's one simple approach, based on fixed order Markov models. A k-th order Markov model predicts successive letters in a sequence based on the previous k letters. Specifically, for k=3, say, the probability of sequence x = x1 x2 ... xn is
P(x) = P(x1 x2 x3) * P(x4 | x1 x2 x3) * P(x5 | x2 x3 x4) ... P(xn | xn-3 xn-2 xn-1)(This is a "plain" Markov model, not a hidden Markov model.) Note that all but the first term in the product are conditional on just the previous k=3 letters. If you know the various probabilities, it is simple to evaluate the probability that a given sequence was generated by this model: take the first k-mer x1 x2 x3 of the sequence, and use it to look up P(x1 x2 x3); for each subsequent letter xi starting with i=4, look up its conditional probability P(xi | xi-3 xi-2 xi-1) based on the k letters preceding it in the sequence. P(x) is the product of these probabilities. (Or, more conveniently, log P(x) is the sum of their logs.)
Getting back to the gene prediction problem, suppose I have a "foreground" set of probabilities like this for real genes, and a similar set of probabilities, say Q( . | . ), for "background" sequences that are not genes. Then I could predict whether a given ORF sequence x was a real gene by comparing P(x) to Q(x). E.g., using a likelihood ratio test, log(P(x)/Q(x)) > 0 implies that the sequence is more likely to have been generated by the gene model than the background model.
Where do the probabilities P come from? Simply take a "trusted" set of example genes and count the frequencies of all (k+1)-letter (contiguous) subsequences. E.g., if k=2 and your training set is just the one sequence AAAAGCTA, then AAA occurs twice, and AAG, AGC, GCT, and CTA each occur once, so P(A | AA) = 2/3, P(G | AA) = 1/3, P(A | CT) = 1/1, etc. Estimate unconditional probabilities for k-letter sub-sequences in the same way. E.g., 3/7 of the adjacent letter pairs in the training set are AA so P(AA) = 3/7, P(AG) = 1/7, etc. Take as your trusted set all long ORFs (defined, arbitrarily, as those longer than 1400 nucleotides); most or all of them are real genes. As your background frequencies, apply the same process to the reverse complement of all long ORFs. (For simplicity, don't split the rev comp into ORFs or worry about reading frame here; just take the full rev-complement. Ignoring these issues will include some stop codons in the background training data, but I don't think that will significantly affect scoring. Use of the rev-comp as the background is justified by the thought that the majority of the genome is coding on one strand or the other, and the long minus-strand ORF of a minus-strand gene will tend to be broken into many short spurious ORFs on the plus strand, in all three reading frames, and will tend to match the sequence statistics seen in reverse complements of protein coding sequences. You might also recognize that a plus-strand gene viewed in either of the two incorrect plus-strand reading frames will also tend to be fragmented into short plus-strand ORFs; you may ignore this issue for the basic assignment, but see the extra credit section for a conceptually simple solution.)
Two notes: First, the "natural" thing to do to estimate P(x1x2...xk) is to base it on only the first k
letters of the training sequences. However, (continuing the k=2 example) with only that one training
sequence, you would end up with P(AA)=1, and the probability of every other letter pair being zero. This is
appropriate if you are confident that all instances start with AA, but not if the available set of training
samples is not comprehensive enough to be fully representative. The suggestion above ("count k-mers throughout the
training sequence(s), not just at their starts"), is basically equivalent to assuming that the starts of sequences are
statistically indistinguishable from their interiors. (If we were including AUG start codons in our gene models, then
starts would be different from interiors (P(AU)=1, e.g.), but since I decided to ignore starts, and there
will be relatively few training sequences longer than 1400, training P(x1 x2...xk) based on full sequences rather
than just their first k letters seems plausible.) My second note is that there are other strategies for dealing with
sparse training data. "Pseudocounts" (discussed in lecture) are a prominent example; while there's a lot of training data in the M. jannaschii example, when k=5
there are a few k-mers with zero counts, so you should include a pseudocount of 1 everywhere. Extra credit
option: explore alternative choices here.
Calculate the P and Q tables as above, including pseudocounts, for k=5, then compute log(P(x)/Q(x)) for each ORF of length k or greater (the ORF's Markov model score; for comparability of results, use natural logs this time).
Evaluation:
The Genbank annotation file
GCF_000091665.1_ASM9166v1_genomic.gff
that you downloaded for HW 3 will also be used here as the "gold
standard" against which to evaluate your gene-finder. The format is described
here,
but I think most of what you need to know will be evident from the following example.
Specifically, it contains (among other things) the coordinates of all the presumed
protein-coding genes, each identified by lines like
NC_000909.1 Protein Homology CDS 4252 4566 . + 0 ID=cds-MJ_RS00015;Parent=gene-MJ_RS00015;Dbxref=GeneID:27929939;Note=frameshifted%3B incomplete%3B partial in the middle of a contig%3B missing N-terminus;gbkey=CDS;inference=COORDINATES: similar to AA sequence:RefSeq:WP_012979808.1;locus_tag=MJ_RS00015;partial=true;product=transporter;pseudo=true;start_range=.,4252;transl_table=11 NC_000909.1 Protein Homology CDS 1664008 1664862 . - 0 ID=cds-WP_010871206.1;Parent=gene-MJ_RS08935;Dbxref=Genbank:WP_010871206.1,GeneID:1452591;Name=WP_010871206.1;gbkey=CDS;inference=COORDINATES: similar to AA sequence:RefSeq:WP_010871206.1;locus_tag=MJ_RS08935;product=zinc metalloprotease HtpX;protein_id=WP_010871206.1;transl_table=11 ##sequence-region NC_001732.1 1 58407Fields are tab-separated. "CDS" in field 3 is shorthand for "coding sequence", and the numbers in fields 4 & 5 are its start and end positions, up to and including the last nucleotide of the stop codon following the ORF. The numbering scheme counts the first position of the genome as 1, not zero. The +/- in field 7 indicates strand; you may ignore minus-strand genes, and all the other goo in the file. [For precision below, let me call these non-ignored entries "simple plus strand CDSs."] Again, process only the part of the file corresponding to the long chromosome, which ends at approximately line 3700; for reference, the last two lines in the example above are the end of the long chromosome and the start of the first plasmid. (If it helps in your file parsing, there are about a dozen "comment" lines starting with "#" at the front of the file, but I think there are none between there and the "##" line shown above. Entries appear to be sorted by start coordinate. Scattershot testing suggests that this unix pipe:
head -n 3700 GCF_000091665.1_ASM9166v1_genomic.gff | grep 'Protein Homology.CDS.[0-9]*.[0-9]*.[.].[+]' > plusgenes-subset.gffwill extract the desired plus-strand genes, although it does discard some context relevant to a few oddball cases. FYI, the 5 unbracketed dots in the grep pattern each match a single tab character; it would be better to replace each by the two character escape sequence "backslash-t", but it is tricky to put that into this web page in a way that I can be sure will be correctly rendered in your browser...)
You will use the .gff file to evaluate the quality of your predictions, but keep in mind that the real goal is to envisage procedures that would help annotate a new genome, one for which no .gff file exists. I.e., can we come up with a procedure that would more or less automatically produce reasonably accurate identification of the protein-coding genes in a newly sequenced organism?
Procedurally, I suggest you build a table containing, for each ORF (there are around 105 of them), a summary consisting of its starting/ending coordinates, length and Markov model score, together with its status (is/is not among the "simple plus strand CDSs" in the .gff file).
Perhaps the simplest plausible gene prediction approach is to guess that every ORF longer than some fixed threshold is a gene. How good it this? What threshold? As detailed below, you will use your ORF finder to examine these questions, by comparing ORFs to the GenBank annotations. Correctly predicting the exact starting positions of genes is difficult, but matching the other end is easier - it will be a stop codon. Again to (over-) simplify the problem, we will declare an ORF to be a successful gene prediction if its stop codon exactly matches the position of a stop codon in one of the "simple plus strand CDSs" in the .gff file. Except in rare cases, which we will again ignore, each such ORF will be at least as long as the real gene.
I think you will see that this simple method is quite accurate for long genes (given my simplified definitions, it cannot fail to find every gene, and above some length threshold it rarely makes false predictions). For shorter genes, however, its rate of false predictions rises. Can we do better using the Markov model? You may be disappointed to find that Markov model scores alone are not better, and the rest if the assignment explores comparing and combining the two.
Outputs: So much data, so little time. What to do with it, how to summarize it?
Generate a single plot showing ROC curves with respect to (a) length threshold, say in red, and (b) Markov model score, say in green, using the full 0-1 range for both axes. Additionally, (c) generate such a plot "zoomed-in" to the upper-left corner to show the crossover between the two curves. For maximum clarity, you probably do not want the two axes to be equally scaled in the "zoomed" view. (Step (c) is important since there are many, many more short ORFs (mostly non-genes) than long ones (mostly real genes), which drives the ROC curves into the upper-left corner.) Also calculate and show Area Under the Curve (AUC) for each curve.
If your only option was to predict based on an ORF length threshold, what is the maximum threshold that would achieve a true positive rate of at least 80%, how many true positives and how many false positives would you see using that threshold? Optionally, plot this point on the ROC curve above.
If your only option was to predict based on a log Markov model score threshold, what is the maximum threshold that would achieve a true positive rate of at least 80%, how many true positives and how many false positives would you see using that threshold? Optionally, plot this point on the ROC curve above.
Can we do better by combining length and MM score? Generate a scatter-plot of Markov model score (Y-axis) vs ORF length (X-axis) for each long and each short ORF. Color points according to their status wrt "simple plus strand CDSs" from GenBank (true protein ORFs: orange; non-proteins: blue). (The coloring is not data that would be available for a novel organism, but looking at it should give you a better idea of what's happening. Also, again note that I do not care that the plots be "elegantly" generated; use a plot package or dump to a text file that Excel can import; whatever.)
Here's a pretty-much seat-of-the-pants idea that Prof. Flashbulb cooked up: Summarize the short ORFs by the single point that falls at the median x, median y of the ORFs of length < 50. Call this point A. Likewise, summarize the long ORFs by the single point that falls at the median x, median y of the ORFs of length > 1400; call this B. Overlay your plot with some visually distinct symbol at A and B, and connect them by a straight line segment. Also draw a straight line perpendicular to this line segment and crossing it at x = Ax + 0.20 * (Bx - Ax), i.e., 20% of the way from A to B. Calculate the equation of this line.
As a reminder, the equation for a line through two points (x1,y1), (x2,y2) is y=mx+b where m=(y1-y2)/(x1-x2) and b=y1-mx1. A line perpendicular to that will have slope -1/m.
Any such line in the plane can be used to define a classifier by declaring that every ORF whose coordinates (length, MM score) places it above/to the right of the line is "a predicted gene", whereas all falling below/to the left of the line are "predicted non-genes". (Flashbulb's contorted logic as to why the particular line defined above is a good classifier is that "the perpendicular bisector of the original line would have been a good decision boundary if the two training sets had had equal variance, etc., but since the negative set had much smaller variance, and since my birthday was last week on the 20th, let's place the decision boundary 20% of the way from A to B.")
How well does this work? Find its associated True Positive and False Positive counts and rates (on the set of all ORFs, not just the short/long training set).
Maybe Flashbulb's idea can be improved. In particular, that "20%" threshold was rather ad hoc, to say the least. Varying that "20%" threshold from minus infinity to plus infinity, i.e., sliding a line parallel to the original 20% line across the plane, will give different tradeoffs between false positives and false negatives. Add the corresponding ROC curve to the graph from step 2 (use a different color), calculate its AUC, and for an 80% true positive rate, calculate the number of false positives (as in steps 3-4) (and optionally plot this point). Again, this is based on all ORFs, not just training ORFs; the values calculated above define one point on this curve.
To (hopefully) clarify what is going on, make another scatter plot, like the one requested at the start of this step, including the A-B line segment and perpendicular line at 20% (as previously calculated, i.e., just based on the training set), but this time plot points for all ORFs, not just the training ORFs. Add thin vertical lines at x=50 and x=1400, just to clarify what is "real" but unknown, vs what was "trustworthy" based on the training setup. In your overall writeup, discuss what you learned from this and/or what might be done to improve the predictions in a scenario like this.
(There are many ways to improve this; see e.g. EC for some suggestions. You may have better ideas; you may explore them as Extra Credit. Be sure to clearly identify them in your writeup.)
Extra Credit: There are many directions you could pursue if you want some extra credit. In no particular order:
Turn in: Turn in your (commented) source code, output logs of any data requested above and a brief write-up (.pdf, .doc, .rtf or plain text) describing what you did/what you observed. If you did any of the extra credit pieces, clearly indicate this at the front of your writeup.
Bundle everything into a single .zip or .tgz file and UPLOAD IT TO CANVAS. Make sure your name, email, and student number are included in the writeup and in comments in your code, and clearly label any extra credit components. Include build/run instructions unless they are completely trivial. Please do not include easily rebuilt object or binary files, core dumps, etc.; they are bulky...
Please do NOT turn in the large genome files you've downloaded (.fna[.gz], .gff[.gz], etc.).
Computer Science & Engineering University of Washington Box 352350 Seattle, WA 98195-2350 (206) 543-1695 voice, (206) 543-2969 FAX |