CSE 427, Wi '21: Assignment #2, Due: Monday, 2/8/21

University of Washington Computer Science & Engineering

CSE Home

About Us

Contact Info

Homework:

The Goal: The goal of this assignment is to investigate a MEME-like algorithm for sequence motif discovery. In brief, you will be given a list of (unaligned) sequences of varying lengths, most or all of which contain one instance of a "motif" (i.e., a short, approximately repeated subsequence), and the goal is to discover it, to build a weight matrix model (WMM, aka PWM or PSSM) capturing the pattern, and to use it to search for other instances of the motif in additional sequences. For scale, think of your input as consisting of at most a few hundred sequences, each at most a few hundred nucleotides in length, and the motif is 10-20 nucleotides long.

Terminology: to be precise, in this assignment I will use the following terminology.

A background frequency vector is a vector containing four non-negative real numbers summing to one. This is to be interpreted as the average or "background" frequencies of the 4 nucleotides in the genome(s) of interest. Indices 1, 2, 3, 4 should correspond to nucleotides A, C, G, T in that order. In this assignment, we will use the vector (0.25, 0.25, 0.25, 0.25), but, for example, the vector (0.2, 0.3, 0.3, 0.2) might be an appropriate background vector for motif search in the M. jannaschii genome mentioned in lecture.)

A pseudocount vector is also a length 4 vector of non-negative reals (with unconstrained sum), again ordered A, C, G, T. (Examples in lecture assumed the same pseudocount, typically 1, was used everywhere, but allowing the pseudocounts to differ depending on nucleotide allows a little more flexibility, especially appropriate when the background vector is not uniform.)

Given a list M containing n sequences, each of length k that are all presumed to be instances of one motif, the count matrix representing it is a 4 by k table whose i, j entry (1 ≤ i ≤ 4, 1 ≤ j ≤ k) is the count of the number of times nucleotide i appears in position j in a string in M. Again, rows 1, 2, 3, 4 should correspond to nucleotides A, C, G, T in that order. Note that each column of such a count table should sum to the same number; before pseudocounting, this number will be n, the number of sequences in M; after pseudocounting, it will be n+p, where p is the sum of the pseudocount vector.

A frequency matrix is another 4 by k table, giving the fraction of each nucleotide at each position, i.e., the result of dividing each entry in a count matrix by the sum of the corresponding column. (Depending on context, "counts" might be before or after adding pseudocounts. Column sums in the frequency matrix should be 1.0 in either case.)

The corresponding weight matrix (WMM) is another 4 by k table whose i,j entry is the base 2 logarithm of the ratio of the frequency matrix entry to the background frequency of that nucleotide. (For simplicity in my slides, I scaled these log ratios by a factor of 10 and rounded them to integers. Do not do this in your program.)

Modules: Your code should contain (at least) the following 8 subroutines.

makeCountMatrix: build a count matrix corresponding to a given list of length k sequences.
addPseudo: given a count matrix, and a pseudocount vector, build a new count matrix by adding them.
makeFrequencyMatrix: make a frequency matrix from a count matrix.
entropy: calculate the entropy of a frequency matrix relative to background.
makeWMM: make a weight matrix from a frequency matrix and background vector. (It may be convenient to save the entropy with the WMM, too.)
scanWMM: given a WMM of width k and one or more sequences of varying lengths ≥ k, scan/score each position of each sequence (excluding those < k from the rightmost end).
Estep: given an (estimated) WMM and a set of sequences, run the E-step of MEME's EM algorithm; i.e., what is E[z_i_j], where z_i_j is the zero-one variable indicating whether the motif instance in sequence i begins in position j. (scanWMM does most of the required work.)
Mstep: given the Estep result, re-estimate the WMM. (This is similar to makeCountMatrix / addPseudo / makeFrequencyMatrix / makeWMM, with the additional wrinkle that the k-mer inputs to makeCountMatrix each have weights, and it may be convenient to generalize makeCountMatrix so that its k-mer inputs are successive, overlapping subsequences of some longer strings, rather than having them explicitly presented as distinct, non-overlapping inputs.)

Inputs: Your key inputs will be FASTA files, each containing several sequences. File format is as in assignment 1, except that it's DNA, not protein. Both upper- and lower-case ACGT should be allowed, and treated equivalently. (Other non-whitespace should be replaced by T. This is not what I'd do in production code, but simple.) I will provide four specific .fasta files ( hw2-debug-train.fasta, hw2-debug-eval.fasta, hw2-train.fasta, hw2-eval.fasta ) for your use. I strongly suggest that you do your initial debugging and testing on files that you make up yourself. "Plant" a motif of your own choosing amidst random flanking sequence of 50-100 bases in a total of 2-10 sequences. Start with short ones with near-identical motifs, and gradually make it more difficult. Train (see below) on that and evaluate (see below) on the same or an additional set of synthetic sequences. Once you are satisfied with your performance on that, try it on my "debug" data sets, which will have about 10 sequences. To promote "crowd-sourced debugging," you may post results on my "debug" sets on the course discussion board. (Just for fun, include your choice of programming language, approximate code length, and runtime on this small example.) Finally, train your motif model (i.e., run your motif discovery algorithm) on my "training" data set, and evaluate the result on my "eval" data set, as described below.

Assume a motif width of k=10, and use (0.25, 0.25, 0.25, 0.25) as your background and pseudocount vectors (except as noted under "Initialization").

EM Initialization: "Seed" your algorithm by creating a count matrix that counts only the k-mer in the k-base window beginning at the first position of the first sequence of your input (i.e., your own input set, or my training sets). Add pseudocounts so that the seed sequence accounts for 85% of the count in each column of the count matrix. Build frequency and weight matrices from this. Repeat this after sliding the seed window ⌊k/2⌋ positions to the right, then another ⌊k/2⌋ positions, etc., stopping when the window abuts the end of the first input sequence (a total of S = ⌈(length(sequence1)-k+5)/⌊k/2⌋⌉ seed WMMs, if I did the math right). E.g., if k=10 and the first input has length 23, you will consider the S=4 seeds at positions 1..10, 6..15, 11..20, and 14..23.

EM Iteration: For each of the S seed WMMs defined in "Initialization," do three E-step/M-step pairs. Among the resulting S WMMs, select three WMMs: the ones attaining the highest entropy (relative to background), the median entropy, and the lowest entropy. Call these WMMs A, B and C, respectively. Run an additional 7 E-step/M-step pairs on all S of the third-round WMMs (a total of 10 E-M rounds on each of the S seed WMMs). Select as your final "motif" the best (highest entropy) of the S candidates after round 10; call this WMM D.

As a simple descriptive summary of this process, print in a tidy-ish S row by 11 column table the entropies of each seed WMM and its 10 successive E-M-refined iterates. Also print the frequency matrices for WMMs A, B, C, and D.

Evaluation: You will evaluate each of the WMMs A, B, C, and D in the following way. (Loosely, the goal is to see whether higher entropy and/or more EM iterations correlates with better motif quality.) My debug-eval dataset contains 10 sequences of length 113, each of which contains a motif instance of length 13 starting in position 51. The full eval data set is similar, but with more sequences. (Both training data sets also contain length 13 motifs, but are more variable as to sequence length and motif position.) To use the eval sets as a "gold standard" against which to evaluate a motif, you scan each sequence with your motif WMM, scoring each of the 113-k+1 potential start positions. Ideally, you would want the exact motif position (51) to get a great score, and all other positions to score poorly. There are two issues with this idealization. First, especially since I chose k ≠ 13, the motif your algorithm chose cannot exactly align with the gold standard 13 base motif instance in the eval file; e.g., its best match to the eval sequences might start a few bases earlier or later than position 51, and similarly it may not end at position 63. Second, we need to define an explicit threshold τ such that scores ≥ τ are declared to be "positives" and those scoring < τ are declared to be "negatives".

To tackle the first issue above, we will adopt the following simple strategy. Let c(j) be the count of the number of eval sequences whose highest WMM score (leftmost highest, if ties) occurred at position j . Plot this histogram. Additionally, calculate and print m = arg max { c(j) | 1 ≤ j ≤ 113-k+1 }. I.e., m is the most common location of the best motif hit in each sequence; equivalently, it is the peak in the c(j) histogram. If things are working correctly, most of the best motif hits should overlap the gold-standard location by at least k/2 positions, i.e., (51-⌊k/2⌋) ≤ m ≤ (63-⌈k/2⌉). (If you do not find this to be true, then one of us has a bug...)

To tackle (somewhat obliquely) the second issue above: given a fixed threshold τ, a motif match to an eval (sub-)sequence will be declared to be a "positive" if it scores ≥ τ; furthermore, consider it to be a "true positive" if starts at position m, otherwise it is a "false positive." Correspondingly, all scores less than τ will be considered to be "negative", with those at m being "false negatives" and all others being "true negatives." There typically is no single "best" answer for choosing a score threshold; as τ varies, you should expect to see a tradeoff between false positives and false negatives. This is often depicted by drawing a so-called Receiver Operating Characteristic (ROC) curve, which plots True Positive Rate (TPR) vs False Positive Rate (FPR). See https://en.wikipedia.org/wiki/Receiver_operating_characteristic for definitions and examples. Also read its discussion of the Area Under the Curve (AUC), a common figure of merit for a binary classifier. (This blog post and its predecessor also give good explanations and some concrete code suggestions--in R, as it happens, but you can easily understand the bulk of it even if you don't know R. They are especially nice at explaining why tied scores generate diagonal lines.)

Generate an ROC plot for your motif, and calculate AUC. To do this for a test set containing n sequences, you will have n * (113 - k + 1) WMM scores, n of them labeled True, and the rest False. After sorting this by score, it is easy to calculate TPR(τ) and FPR(τ) for τ equal to any observed score (which is all you need; if τ is between two distinct observed scores, then TPR and FPR are the same as they are when τ equals one of the adjacent observed scores; i.e., ignoring tied scores, the observed ROC curve is the step function traced out as τ varies across the observed scores). Additionally, as one concrete (but quite arbitrary) point on the curve for WMM C, what is the largest τ recovering all True Positives, how many False Positives, True Negatives and False Negatives are found with the same τ and what are the corresponding TPR and FPR values?

To summarize, for the train/eval data set pair you should generate 4 histograms, 4 AUC numbers, and 1 ROC plot with 4 curves on it, and the specific values for the one τ point for WMM C. (Use 4 colors or line styles to distinguish the 4 curves.) You do not need to turn in the corresponding plots for the "debug" data sets, but do turn in the plots for the larger eval data sets.

Extra Credit: There are many additional (largely independent) directions for extra credit. As usual, pursue those that interest you, and talk to the TAs or me if you have other ideas. In all cases, summarize your comparison(s) and describe your evaluation methodology, including finding/generating appropriate test data.

If you rerun the basic assignment assuming an inappropriate background model, e.g., the M. jannaschii-like (0.2, 0.3, 0.3, 0.2) background vector, I expect results to worsen. Do they? How much? (Note that the E-Step formulas given on the lecture sides assumed uniform background and need to be modified slightly to correctly handle the non-uniform case; pay special attention to the effect of non-motif sequence "flanking" the motif. Also reconsider conversion of scores to zij probabilities.)
For our mouse data, is there a more realistic background than the uniform 25% model? How would you find it? Does it work better?
How much effect does the pseudocount vector have? Does it matter more or less on small training sets than large ones?
Find a program or web tool (or implement your own) for generating "Sequence Logos", and use it to visualize each of the frequency matrices you report.
The "slide your seed window 5 positions" is basically a performance hack -- it will be 5 times faster than "slide 1 position", but it might miss things. Try "Slide 1" intead. Is it better? In what way?
In any case, but especially with something like the "slide 5" approach, you might settle on a motif that is offset a few positions from the optimum, and/or is too wide or too narrow. Investigate approaches to ameliorate these problems.
Another performance hack (similar to the real MEME package) that you might consider -- select only a few, say 3, of the most promising candidates at step 3, then run them for another 7 E-M rounds (or until some more formal convergence criterion is satisfied), as opposed to iterating on all S candidates. How much faster is this? Does it consistently find a good final motif?
I have mentioned before that "replace non-ACGT by T", while simple, is less than ideal. Here is a better idea: In the IUPAC nucleotide code (google it) the letter "R", for example, stands for "puRine", i.e., either A or G, while "N" stands for "aNy" nucleotide. When the input contains one of these letters, replace it by a randomly selected letter from the allowed set. Even better, in applications where this is clearly definable, treat it as a mixture of the possibilities; e.g., when frequency counting an R, add 0.5 to both A and G, and when scoring a WMM vs a sequence containing R, average the A and G scores. You could explore these options by replacing, say, 10% of letters in this assignment's input file by R (or a mixture of IUPAC codes), then rerun your plain algorithm where they all become T, and compare that to a version of your algorithm modified to use the idea given here.
In my solution to this assignment, several starting seeds yielded motifs that were (by eye) "similar" to what I believe to be the "best" answer. How might you quantify "similarity"/"dis-similarity" between WMMs? Especially if one motif is offset a few positions with respect to the othert? Can you automate such comparisons?
The assignment basically ignores termination of the EM algorithm for this problem. Investigate approaches to automating this. Are the solutions found by your algorithm (as defined by the basic assignment) "converged," or will they change, perhaps slowly but substantially, if more formal convergence criteria are added?
Consider a short input sequence, vs a much longer one, especially when neither contains an especially strong match to the concensus. E.g., in the extreme where the short one is exactly as long as the presumed motif length, the whole sequence would be taken as the motif instance with probability 1, no matter how poorly it matches--perhaps a bad idea. Are there better ways to handle this?
Modify the learning algorithm to allow some prior probability, perhaps length-dependent (see above), that the sequence does not contain a motif instance. How would the algorithm change? How might you set this prior probability? Does it perform better? How would you evaluate that?

Code:

Deliverables: Your code, and a writeup (.pdf, .doc, or .rtf) containing descriptions of any extra credit steps you tried, and the results requested above (when running on my train/eval datasets, not the debug sets):

11 entropies of the successive EM iterations on each of the S seed sequence,
the 4 frequency matrices,
your 4 histograms of "best hit" locations and the 4 values of m,
your ROC plots and AUC values, and
the specific τ value requested, plus associated 4-way classification of all hits and TPR/FPR for WMM C.
did I forget anything?

Turn-in: Bundle everything into a single .zip or .tgz file and UPLOAD IT TO CANVAS. Make sure your name, email, and student number are included in the writeup and in comments in your code, and clearly label any extra credit components. Include build/run instructions unless they are completely trivial. Please do not include easily rebuilt object or binary files, core dumps, etc.; they are bulky...

Keep your turn ins to less than about 1 megabyte. In particular, do NOT turn in the large FASTA files you've downloaded or built, but do tell us what your program needs to run.


	Computer Science & Engineering University of Washington Box 352350 Seattle, WA 98195-2350 (206) 543-1695 voice, (206) 543-2969 FAX