image University of Washington Computer Science & Engineering
  CSE 427Wi '21:  Assignment #4, Due: Monday, 3/8/21
  CSE Home   About Us    Search    Contact Info 

Reading:

See Schedule page.

Homework:

    Prediction of Protein Coding Genes: In this assignment we will look at the basics of gene finding. This write-up is lengthy, but the basic ideas are simple. For a specific prokaryotic genome, you are going to identify all plus-strand open reading frames (ORFs), score each according to a Markov model trained on a subset of the genome, and attempt to classify all plus-strand ORFs as "real" protein-coding genes, or not, based on ORF length, MM score, or a combination of both, and then evaluate the success of the classification based on a trusted annotation of the genome obtained from Genbank. To avoid some tedious bookkeeping, we will make several simplifying assumptions. Details follow.

    In lecture, I described 2 basic ideas that are exploited in nearly all gene finding programs.

    Idea #1: look for long open reading frames (ORFs).

    Definition An ORF is a (maximal length) sequence of consecutive nucleotide triplets, none of which is a stop codon (TAA, TAG, or TGA).
    "Maximal length" implies that the triplets immediately preceding and immediately following an ORF are stop codons (or that there is no adjacent triplet, as will happen at the beginning and end of the genome sequence). In prokaryotes (bacteria and archea), essentially every protein-coding gene falls at the end of an ORF. (As promised, we are simplifying, principly in that we aren't going to bother to look for the start codon.) ORFs are easy to find. Starting at the leftmost triplet of the genome, scan successive, adjacent, non-overlapping triplets until you find a stop; then continue until you find another stop, etc. Those are all the ORFs whose start positions have indices congruent to 1 mod 3. Then repeat the whole process from the left end again, but omitting the very first, or first two, nucleotides (to account for the other two possible reading frames, congruent to 2 mod 3 and 0 mod 3, resp.). [One should also repeat this from the right end on the Watson-Crick complement of the sequence; again in the interest of simplicity, we're going to completely ignore that.]

    The first part of this problem is to write a program to do the above process. We will use the genome sequence of Methanocaldococcus jannaschii, the same one you downloaded for HW 3. Again, ignore the two shorter plasmid sequences at the end of the .fna file, and treat nucleotide symbols other than A, C, G as T. Record the indices of the first and last nucleotide in each ORF you find. (Recall, by convention, genome indices are 1-based, not 0-based.)

    Idea #2: Look for ORFs having sequence statistics that are more typical of known genes than of noncoding DNA.

    Here's one simple approach, based on fixed order Markov models. A k-th order Markov model predicts successive letters in a sequence based on the previous k letters. Specifically, for k=3, say, the probability of sequence x = x1 x2 ... xn is

      P(x) = P(x1 x2 x3) * P(x4 | x1 x2 x3) * P(x5 | x2 x3 x4) ... P(xn | xn-3 xn-2 xn-1)
    
    (This is a "plain" Markov model, not a hidden Markov model.) Note that all but the first term in the product are conditional on just the previous k=3 letters. If you know the various probabilities, it is simple to evaluate the probability that a given sequence was generated by this model: take the first k-mer x1 x2 x3 of the sequence, and use it to look up P(x1 x2 x3); for each subsequent letter xi starting with i=4, look up its conditional probability P(xi | xi-3 xi-2 xi-1) based on the k letters preceding it in the sequence. P(x) is the product of these probabilities. (Or, more conveniently, log P(x) is the sum of their logs.)

    Getting back to the gene prediction problem, suppose I have a "foreground" set of probabilities like this for real genes, and a similar set of probabilities, say Q( . | . ), for "background" sequences that are not genes. Then I could predict whether a given ORF sequence x was a real gene by comparing P(x) to Q(x). E.g., using a likelihood ratio test, log(P(x)/Q(x)) > 0 implies that the sequence is more likely to have been generated by the gene model than the background model.

    Where do the probabilities P come from? Simply take a "trusted" set of example genes and count the frequencies of all (k+1)-letter (contiguous) subsequences. E.g., if k=2 and your training set is just the one sequence AAAAGCTA, then AAA occurs twice, and AAG, AGC, GCT, and CTA each occur once, so P(A | AA) = 2/3, P(G | AA) = 1/3, P(A | CT) = 1/1, etc. Estimate unconditional probabilities for k-letter sub-sequences in the same way. E.g., 3/7 of the adjacent letter pairs in the training set are AA so P(AA) = 3/7, P(AG) = 1/7, etc. Take as your trusted set all long ORFs (defined, arbitrarily, as those longer than 1400 nucleotides); most or all of them are real genes. As your background frequencies, apply the same process to the reverse complement of all long ORFs. (For simplicity, don't split the rev comp into ORFs or worry about reading frame here; just take the full rev-complement. Ignoring these issues will include some stop codons in the background training data, but I don't think that will significantly affect scoring. Use of the rev-comp as the background is justified by the thought that the majority of the genome is coding on one strand or the other, and the long minus-strand ORF of a minus-strand gene will tend to be broken into many short spurious ORFs on the plus strand, in all three reading frames, and will tend to match the sequence statistics seen in reverse complements of protein coding sequences. You might also recognize that a plus-strand gene viewed in either of the two incorrect plus-strand reading frames will also tend to be fragmented into short plus-strand ORFs; you may ignore this issue for the basic assignment, but see the extra credit section for a conceptually simple solution.)

    Two notes: First, the "natural" thing to do to estimate P(x1x2...xk) is to base it on only the first k letters of the training sequences. However, (continuing the k=2 example) with only that one training sequence, you would end up with P(AA)=1, and the probability of every other letter pair being zero. This is appropriate if you are confident that all instances start with AA, but not if the available set of training samples is not comprehensive enough to be fully representative. The suggestion above ("count k-mers throughout the training sequence(s), not just at their starts"), is basically equivalent to assuming that the starts of sequences are statistically indistinguishable from their interiors. (If we were including AUG start codons in our gene models, then starts would be different from interiors (P(AU)=1, e.g.), but since I decided to ignore starts, and there will be relatively few training sequences longer than 1400, training P(x1 x2...xk) based on full sequences rather than just their first k letters seems plausible.) My second note is that there are other strategies for dealing with sparse training data. "Pseudocounts" (discussed in lecture) are a prominent example; while there's a lot of training data in the M. jannaschii example, when k=5 there are a few k-mers with zero counts, so you should include a pseudocount of 1 everywhere. Extra credit option: explore alternative choices here.

    Calculate the P and Q tables as above, including pseudocounts, for k=5, then compute log(P(x)/Q(x)) for each ORF of length k or greater (the ORF's Markov model score; for comparability of results, use natural logs this time).

    Evaluation: The Genbank annotation file GCF_000091665.1_ASM9166v1_genomic.gff that you downloaded for HW 3 will also be used here as the "gold standard" against which to evaluate your gene-finder. The format is described here, but I think most of what you need to know will be evident from the following example. Specifically, it contains (among other things) the coordinates of all the presumed protein-coding genes, each identified by lines like

        NC_000909.1	Protein Homology	CDS	4252	4566	.	+	0	ID=cds-MJ_RS00015;Parent=gene-MJ_RS00015;Dbxref=GeneID:27929939;Note=frameshifted%3B incomplete%3B partial in the middle of a contig%3B missing N-terminus;gbkey=CDS;inference=COORDINATES: similar to AA sequence:RefSeq:WP_012979808.1;locus_tag=MJ_RS00015;partial=true;product=transporter;pseudo=true;start_range=.,4252;transl_table=11
    
        NC_000909.1	Protein Homology	CDS	1664008	1664862	.	-	0	ID=cds-WP_010871206.1;Parent=gene-MJ_RS08935;Dbxref=Genbank:WP_010871206.1,GeneID:1452591;Name=WP_010871206.1;gbkey=CDS;inference=COORDINATES: similar to AA sequence:RefSeq:WP_010871206.1;locus_tag=MJ_RS08935;product=zinc metalloprotease HtpX;protein_id=WP_010871206.1;transl_table=11
    
        ##sequence-region NC_001732.1 1 58407
    
    Fields are tab-separated. "CDS" in field 3 is shorthand for "coding sequence", and the numbers in fields 4 & 5 are its start and end positions, up to and including the last nucleotide of the stop codon following the ORF. The numbering scheme counts the first position of the genome as 1, not zero. The +/- in field 7 indicates strand; you may ignore minus-strand genes, and all the other goo in the file. [For precision below, let me call these non-ignored entries "simple plus strand CDSs."] Again, process only the part of the file corresponding to the long chromosome, which ends at approximately line 3700; for reference, the last two lines in the example above are the end of the long chromosome and the start of the first plasmid. (If it helps in your file parsing, there are about a dozen "comment" lines starting with "#" at the front of the file, but I think there are none between there and the "##" line shown above. Entries appear to be sorted by start coordinate. Scattershot testing suggests that this unix pipe:
        head -n 3700 GCF_000091665.1_ASM9166v1_genomic.gff | grep 'Protein Homology.CDS.[0-9]*.[0-9]*.[.].[+]' > plusgenes-subset.gff
    
    will extract the desired plus-strand genes, although it does discard some context relevant to a few oddball cases. FYI, the 5 unbracketed dots in the grep pattern each match a single tab character; it would be better to replace each by the two character escape sequence "backslash-t", but it is tricky to put that into this web page in a way that I can be sure will be correctly rendered in your browser...)

    You will use the .gff file to evaluate the quality of your predictions, but keep in mind that the real goal is to envisage procedures that would help annotate a new genome, one for which no .gff file exists. I.e., can we come up with a procedure that would more or less automatically produce reasonably accurate identification of the protein-coding genes in a newly sequenced organism?

    Procedurally, I suggest you build a table containing, for each ORF (there are around 105 of them), a summary consisting of its starting/ending coordinates, length and Markov model score, together with its status (is/is not among the "simple plus strand CDSs" in the .gff file).

    Perhaps the simplest plausible gene prediction approach is to guess that every ORF longer than some fixed threshold is a gene. How good it this? What threshold? As detailed below, you will use your ORF finder to examine these questions, by comparing ORFs to the GenBank annotations. Correctly predicting the exact starting positions of genes is difficult, but matching the other end is easier - it will be a stop codon. Again to (over-) simplify the problem, we will declare an ORF to be a successful gene prediction if its stop codon exactly matches the position of a stop codon in one of the "simple plus strand CDSs" in the .gff file. Except in rare cases, which we will again ignore, each such ORF will be at least as long as the real gene.

    I think you will see that this simple method is quite accurate for long genes (given my simplified definitions, it cannot fail to find every gene, and above some length threshold it rarely makes false predictions). For shorter genes, however, its rate of false predictions rises. Can we do better using the Markov model? You may be disappointed to find that Markov model scores alone are not better, and the rest if the assignment explores comparing and combining the two.

    Outputs: So much data, so little time. What to do with it, how to summarize it?

    1. This sort of analysis has an unusually large number of opportunities for "off-by-one" and other subtle errors. To help with debugging and grading, calculate, print and turn-in the following simple diagnostic parameters. You may post (a)-(f) on the course discussion board so that you can help each other get past these potential nuisance bugs. For the following, I (arbitrarily) define "short ORFs" to be those of length less than 50, and "long ORFs" to be those of length greater than 1400. Print:
      1. for each of the three reading frames, the number of ORFs you find, and the summary of the first and last of each
      2. the total number of short ORFs (length less than 50),
      3. the total number of long ORFs (length greater than 1400),
      4. the total number of "simple plus strand CDSs" you found in GenBank,
      5. a sample of the counts you accumulate that factor into calculating the conditional probabilities; specifically (and very arbitrarily) print P(T | AAGxy), Q(T | AAGxy) for the 16 possible combinations of x,y in A,C,G,T. (To be clear, AAG are the leftmost three nucleotides of the 5 preceding the putative "T".) For consistency, please print this as a 4 x 4 table with rows for x = A, C, G, T in that order, and columns for y in the same order, and
      6. your summary data for the first 5 short ORFs and the first 5 long ORFs. "First 5" means "the 5 with the smallest start coordinate." Again, your "summary" of each ORF should include its start coordinate, end cordinate, length, log-base-e Markov model score and a flag indicating whether this ORF was/was not among the "simple plus strand CDSs" in GenBank.
    2. Generate a single plot showing ROC curves with respect to (a) length threshold, say in red, and (b) Markov model score, say in green, using the full 0-1 range for both axes. Additionally, (c) generate such a plot "zoomed-in" to the upper-left corner to show the crossover between the two curves. For maximum clarity, you probably do not want the two axes to be equally scaled in the "zoomed" view. (Step (c) is important since there are many, many more short ORFs (mostly non-genes) than long ones (mostly real genes), which drives the ROC curves into the upper-left corner.) Also calculate and show Area Under the Curve (AUC) for each curve.

    3. If your only option was to predict based on an ORF length threshold, what is the maximum threshold that would achieve a true positive rate of at least 80%, how many true positives and how many false positives would you see using that threshold? Optionally, plot this point on the ROC curve above.

    4. If your only option was to predict based on a log Markov model score threshold, what is the maximum threshold that would achieve a true positive rate of at least 80%, how many true positives and how many false positives would you see using that threshold? Optionally, plot this point on the ROC curve above.

    5. Can we do better by combining length and MM score? Generate a scatter-plot of Markov model score (Y-axis) vs ORF length (X-axis) for each long and each short ORF. Color points according to their status wrt "simple plus strand CDSs" from GenBank (true protein ORFs: orange; non-proteins: blue). (The coloring is not data that would be available for a novel organism, but looking at it should give you a better idea of what's happening. Also, again note that I do not care that the plots be "elegantly" generated; use a plot package or dump to a text file that Excel can import; whatever.)

      Here's a pretty-much seat-of-the-pants idea that Prof. Flashbulb cooked up: Summarize the short ORFs by the single point that falls at the median x, median y of the ORFs of length < 50. Call this point A. Likewise, summarize the long ORFs by the single point that falls at the median x, median y of the ORFs of length > 1400; call this B. Overlay your plot with some visually distinct symbol at A and B, and connect them by a straight line segment. Also draw a straight line perpendicular to this line segment and crossing it at x = Ax + 0.20 * (Bx - Ax), i.e., 20% of the way from A to B. Calculate the equation of this line.

      As a reminder, the equation for a line through two points (x1,y1), (x2,y2) is y=mx+b where m=(y1-y2)/(x1-x2) and b=y1-mx1. A line perpendicular to that will have slope -1/m.

      Any such line in the plane can be used to define a classifier by declaring that every ORF whose coordinates (length, MM score) places it above/to the right of the line is "a predicted gene", whereas all falling below/to the left of the line are "predicted non-genes". (Flashbulb's contorted logic as to why the particular line defined above is a good classifier is that "the perpendicular bisector of the original line would have been a good decision boundary if the two training sets had had equal variance, etc., but since the negative set had much smaller variance, and since my birthday was last week on the 20th, let's place the decision boundary 20% of the way from A to B.")

      How well does this work? Find its associated True Positive and False Positive counts and rates (on the set of all ORFs, not just the short/long training set).

      Maybe Flashbulb's idea can be improved. In particular, that "20%" threshold was rather ad hoc, to say the least. Varying that "20%" threshold from minus infinity to plus infinity, i.e., sliding a line parallel to the original 20% line across the plane, will give different tradeoffs between false positives and false negatives. Add the corresponding ROC curve to the graph from step 2 (use a different color), calculate its AUC, and for an 80% true positive rate, calculate the number of false positives (as in steps 3-4) (and optionally plot this point). Again, this is based on all ORFs, not just training ORFs; the values calculated above define one point on this curve.

      To (hopefully) clarify what is going on, make another scatter plot, like the one requested at the start of this step, including the A-B line segment and perpendicular line at 20% (as previously calculated, i.e., just based on the training set), but this time plot points for all ORFs, not just the training ORFs. Add thin vertical lines at x=50 and x=1400, just to clarify what is "real" but unknown, vs what was "trustworthy" based on the training setup. In your overall writeup, discuss what you learned from this and/or what might be done to improve the predictions in a scenario like this.

      (There are many ways to improve this; see e.g. EC for some suggestions. You may have better ideas; you may explore them as Extra Credit. Be sure to clearly identify them in your writeup.)

    Extra Credit: There are many directions you could pursue if you want some extra credit. In no particular order:

    1. As noted earlier the long ORF of a (plus-strand) protein coding gene will tend to be broken into many short ORFs when viewed in either of the wrong (plus-strand) reading frames, but they will actually tend to score reasonably well against our Markov model and hence tend to contribute to our false positive rate. A (conceptually) simple step to eliminate this problem would be to discard any ORF that is wholly contained within another ORF, under the assumption that "longer is better". This would probably reduce the number of false positives (although perhaps adding some false negatives). Explore this. Short ORFs overlapping either end of a long ORF may also be problematic, but it's less clear how to handle those cases. What are your thoughts? Explore this too if you'd like.
    2. Most compellingly, are there better ways than Flashbulb's suggestion to combine ORF length and Markov model score to increase accuracy, especially for shorter ORFs? (Keep in mind that you should not assume that GenBank annotations are available for prediction; you're trying to invent ways to create such annotations...)
    3. Explore alternative ways of handling the initial k-mer and/or pseudocounting, as mentioned above.
    4. Handle the reverse strand, too.
    5. I think you'll find a few very long ORFs that don't exactly correspond to the Genbank annotations, and other anomalies; what's up with that?
    6. Maybe a 3-class model would be better: gene/reverse-complement of a gene/other.
    7. Explore how performance depends on larger and/or smaller k.
    8. Would an inhomogeneous Markov model be better, i.e., one that learned/used different parameters for each of the three offsets within a codon?
    9. Could this approach be extended to make better predictions of the start codon, too? (Most or all genes in M. jannaschii start with ATG, but there can be multiple ATG's in an ORF; which it the right one?)
    10. The 1400 nucleotide ORF length threshold I suggested for training the Markov model is somewhat arbitrary, designed to get a high-confidence training set. Plausibly, we could iterate this, relearning the Markov model based on additional shorter, but still high-confidence, hits from the first pass. Does this change things enough to be valuable?
    11. Continuing that, is there a big danger that you'd actually worsen the model due to inclusion of more false positives? Explore these ideas, keeping in mind that in the "real" application, we wouldn't have a "gold standard" .gff file available for consultation.
    12. Would it be better to train P and Q based on genes from a completely different but well-annotated genome, say that of E. coli; i.e., use all genes and non-genes from its .gff file as training data.
    13. I'm open to other ideas, too; send me email.

    Code: As before, you may use C, C++, C#, Java, Julia, Python, R. It's ok to make your code relatively special-purpose, e.g., array sizes and initial parameters as compiled-in constants, minimal error checking, etc. You may want to use a more graphics-friendly language for the plots, e.g. R or Matlab, but Excel is probably also sufficient (via exporting a tab-separated data file or such).

    Turn in: Turn in your (commented) source code, output logs of any data requested above and a brief write-up (.pdf, .doc, .rtf or plain text) describing what you did/what you observed. If you did any of the extra credit pieces, clearly indicate this at the front of your writeup.

    Bundle everything into a single .zip or .tgz file and UPLOAD IT TO CANVAS. Make sure your name, email, and student number are included in the writeup and in comments in your code, and clearly label any extra credit components. Include build/run instructions unless they are completely trivial. Please do not include easily rebuilt object or binary files, core dumps, etc.; they are bulky...

    Please do NOT turn in the large genome files you've downloaded (.fna[.gz], .gff[.gz], etc.).


CSE logo Computer Science & Engineering
University of Washington
Box 352350
Seattle, WA  98195-2350
(206) 543-1695 voice, (206) 543-2969 FAX