CSE 427, Au '21: Assignment #2 <font size="-2">(rev. a)</font>, Due: Friday, 11/5/21

University of Washington Computer Science & Engineering

CSE 427, Au '21: Assignment #2 (rev. a), Due: Friday, 11/5/21

CSE Home

About Us

Contact Info

Homework:

The Goal: The goal of this assignment is to investigate a MEME-like algorithm for sequence motif discovery. In brief, you will be given a list of (unaligned) sequences, most or all of which contain one instance of a "motif" (i.e., a short, approximately repeated subsequence), and the goal is to discover it, to build a weight matrix model (WMM, aka PWM or PSSM) capturing the pattern, and to use it to search for other instances of the motif in additional sequences. For scale, think of your input as consisting of 10-1000 sequences, each about 100 nucleotides in length, and the motif is 10-20 nucleotides long.

Terminology: to be precise, in this assignment I will use the following terminology.

The background frequency of a nucleotide is its assumed average frequency outside of the motif. For simplicity, in this assignment, we will use 0.25 as the background frequency of each nucleotide.

A pseudocount vector is a length 4 vector of non-negative reals (with unconstrained sum), ordered A, C, G, T. (Examples in lecture assumed the same pseudocount, typically 1, was used everywhere, but allowing the pseudocounts to differ depending on nucleotide allows more flexibility.)

Given a list M containing n sequences, each of length k that are all presumed to be instances of one motif, the count matrix representing it is a 4 by k table whose i, j entry (1 ≤ i ≤ 4, 1 ≤ j ≤ k) is the count of the number of times nucleotide i appears in position j in a string in M. Again, rows 1, 2, 3, 4 should correspond to nucleotides A, C, G, T in that order. Note that each column of such a count table should sum to the same number; before pseudocounting, this number will be n, the number of sequences in M; after pseudocounting, it will be n+p, where p is the sum of the pseudocount vector.

A frequency matrix is another 4 by k table, giving the fraction of each nucleotide at each position, i.e., the result of dividing each entry in a count matrix by the sum of the corresponding column. (Depending on context, "counts" might be before or after adding pseudocounts. Column sums in the frequency matrix should be 1.0 in either case.)

The corresponding weight matrix (WMM) is another 4 by k table whose i,j entry is the base 2 logarithm of the ratio of the frequency matrix entry to the background frequency of that nucleotide. (For simplicity in my slides, I scaled these log ratios by a factor of 10 and rounded them to integers. Do not do this in your program.)

Modules: Your code should contain (at least) the following 8 subroutines.

makeCountMatrix: build a count matrix corresponding to a given list M of length k sequences.
addPseudo: given a count matrix, and a pseudocount vector, build a new count matrix by adding the pseudocount vector to each column.
makeFrequencyMatrix: make a frequency matrix from a count matrix.
entropy: calculate the relative entropy of a frequency matrix relative to background.
makeWMM: make a weight matrix from a frequency matrix. (It may be convenient to save the entropy with the WMM, too.)
scanWMM: given a WMM of width k and one or more sequences of length ≥ k, scan/score each position of each sequence (excluding those < k from the rightmost end).
Estep: given an (estimated) WMM and a set of sequences, run the E-step of MEME's EM algorithm; i.e., what is E[z_i_j], where z_i_j is the zero-one variable indicating whether the motif instance in sequence i begins in position j. (scanWMM does most of the required work.)
Mstep: given the Estep result, re-estimate the WMM. (This is similar to makeCountMatrix / addPseudo / makeFrequencyMatrix / makeWMM, with the additional wrinkle that the k-mer inputs to makeCountMatrix each have weights, and it may be convenient to generalize makeCountMatrix so that its k-mer inputs are successive, overlapping subsequences of some longer strings, rather than having them explicitly presented as distinct, non-overlapping inputs.)

Inputs: Your key inputs will be FASTA files, each containing several sequences. File format is as in assignment 1, except that it's DNA, not protein. Both upper- and lower-case ACGT should be allowed, and treated equivalently. (Other non-whitespace should be replaced by T. This is not what I'd do in production code, but simple.) I will provide four specific .fasta files ( hw2-debug-train.fasta, hw2-debug-eval.fasta, hw2-final-train.fasta, hw2-final-eval.fasta ) for your use. I suggest that you do your initial debugging and testing on files that you make up yourself. "Plant" a motif of your own choosing amidst random flanking sequence of up to 100 bases in a total of 2-10 sequences. Start with short ones with near-identical motifs, and gradually make it more difficult. Train (see below) on that and evaluate (see below) on the same or an additional set of synthetic sequences. Once you are satisfied with your performance on that, try it on my "debug" data sets, which will have 10--20 sequences. To promote "crowd-sourced debugging," you may post results on my "debug" sets on the course discussion board. (Just for fun, include your choice of programming language, approximate code length, and runtime on this small example.) Finally, train your motif model (i.e., run your motif discovery algorithm) on my "final-train" data set, and evaluate the result on my "final-eval" data set, as described below.

Assume a motif width of k=13, and use (0.25, 0.25, 0.25, 0.25) as your pseudocount vector (except as noted under "Initialization"). All sequences in my inputs will be of length L=100+k.

EM Initialization: Any length k substring of the input might be a motif instance, and given our assumption that most input sequences contain a motif instance, we will do a MEME-like brute force initialization by using every length k subsequence of the first input sequence as "seeds." One or more of these should be a good starting point for the EM interation. But of course each of them may be a bit "atypical," so we shouldn't demand an exact match to the seed sequence. Instead, for each 1 ≤ j ≤ L-k+1, create a count matrix that counts only the single k-mer in the k-base window beginning at the j-th position of the first sequence of your input (i.e., your own input set, or one of my training sets). Add pseudocounts so that the seed sequence accounts for 85% of the count in each column of the count matrix. Build frequency and weight matrices from this. This will give you S = 101 seed WMMs.

EM Iteration: For each of the S seed WMMs defined in "Initialization," do ten E-step/M-step pairs. Select as your final "motif" the best (highest entropy) of the S candidates after round 10; call this WMM D. For comparison purposes, I also want to single out the WMM built by 10 E-M rounds on the seed sequence at position 51; call this WMM E.

As a simple descriptive summary of this process, print in a tidy-ish S row by 11 column table the entropies of those seed WMMs and their 10 successive E-M-refined iterates. Also print the frequency matrices for WMMs D and E.

Evaluation: You will evaluate each of the S WMMs built by 10 E-M iterations on the seeds in the following way.

My debug-eval dataset contains 20 sequences of length 113, each of which contains a motif instance of length 13 starting in position 51. The full eval data set is similar, but with more sequences. (Both training data sets also contain length 13 motifs, but are more variable as to motif position.)

Visually, in my files, the motif instance is marked by use of uppercase letters, whereas the (~50-base) flanking sequences are all lower case. This may aid your debugging, but your code should not exploit this cheat; in real data you'd never know this.

To use the eval sets as a "gold standard" against which to evaluate a motif, scan each sequence with your motif WMM, scoring each of the 113-k+1 potential start positions, and determine the position x_i having the highest score in sequence i. Ideally, x_i=51 would get the highest score, but it is not unlikely that your motif might start a few positions early or late. As a simple summary, report the mean and standard deviation of y_i = (x_i-51) for the subset of positions where abs(y_i) ≤ 5, and report the number of eval sequences for which abs(y_i) > 5. (E.g., the later quantity counts the number of "bad" predictions, where, simplistically, "bad" means that it overlaps the "correct" instance by ≤ 7 of the 13 positions.) Do this separately for each of the S WMMs, and print these three values in three additional columns appended to the S x 11 table requested above. Your turn-in should only include the table for the "final" data, not the "debug" data.

Also answer to the following questions, which may help you interpret your results. (A few sentences will suffice for 3--5; I'm not expecting a deep essay.)

Suppose X is uniformly distributed over the integers -5, -4, ..., +5. What are the mean, variance, and standard deviation of X?
Suppose you sample (x₁, x₂, ..., x_n) and observe x₁ = -5, x₂ = +5 and the rest are zero. What are the approximate mean, variance and standard deviation of this sample when n = 16? When n = 900?
What can you say about the number of "bad" predictions being made by this method on the "debug" data? On the "final" data? E.g., do these numbers seem to suggest that the E-M algorithm is succeeding or failing?
What do your answers to the first two questions suggest about the distribution of your predicted motif start sites relative to the "correct" start sites?
Looking at the frequency tables you printed for the highlighted WMMs D and E in the "final" data, give us your thoughts as to why the predicted start positions are sometimes a bit off, and as to why the algorithm lands on WMMs as different as D and E.

Note: This assignment is a bit compute-heavy; my (clumsy) implementation takes several minutes on my (aged) laptop. If speed degrades your coding/debugging cycle, you might try using fewer E-M iterations (perhaps 5 instead of 10), and/or fewer seeds (maybe every k-th starting position for k>1 rather than k=1) for your debugging/early experiments. Revert to the requested iters=10 and k=1 for your final runs.

Extra Credit: There are many additional (largely independent) directions for extra credit. As usual, pursue those that interest you, and talk to the TAs or me if you have other ideas. In all cases, summarize in your write-up your comparison(s) and describe your evaluation methodology, including finding/generating appropriate test data.

Generate histograms of x_i or y_i which will show more detail about the algorithm's performance than the requested mean/variance summary statistics.
If you rerun the basic assignment assuming an inappropriate background model, e.g., the M. jannaschii-like (0.2, 0.3, 0.3, 0.2) background vector, I expect results to worsen. Do they? How much? (Note that the E-Step formulas given on the lecture sides assumed uniform background and need to be modified slightly to correctly handle the non-uniform case; pay special attention to the effect of non-motif sequence "flanking" the motif. Also reconsider conversion of scores to zij probabilities.)
For our mouse data, is there a more realistic background than the uniform 25% model? How would you find it? Does it work better?
How much effect does the pseudocount vector have? Does it matter more or less on small training sets than large ones?
Find a program or web tool (or implement your own) for generating "Sequence Logos", and use it to visualize each of the frequency matrices you report.
You might settle on a motif that is offset a few positions from the optimum, and/or is too wide or too narrow. Investigate approaches to ameliorate these problems.
Another performance hack (similar to the real MEME package) that you might consider -- Run the initially selected seeds for only a few E-M rounds, then select a smaller subset of the most promising candidates to run until some more formal convergence criterion is satisfied, as opposed to iterating on all S candidates. How much faster is this? Does it consistently find a good final motif?
I have mentioned before that "replace non-ACGT by T", while simple, is less than ideal. Here is a better idea: In the IUPAC nucleotide code (google it) the letter "R", for example, stands for "puRine", i.e., either A or G, while "N" stands for "aNy" nucleotide. When the input contains one of these letters, replace it by a randomly selected letter from the allowed set. Even better, in applications where this is clearly definable, treat it as a mixture of the possibilities; e.g., when frequency counting an R, add 0.5 to both A and G, and when scoring a WMM vs a sequence containing R, average the A and G scores. You could explore these options by replacing, say, 10% of letters in this assignment's input file by R (or a mixture of IUPAC codes), then rerun your plain algorithm where they all become T, and compare that to a version of your algorithm modified to use the idea given here.
In my solution to this assignment, several starting seeds yielded motifs that were (by eye) "similar" to what I believe to be the "best" answer. How might you quantify "similarity"/"dis-similarity" between WMMs? Especially if one motif is offset a few positions with respect to the other? Can you automate such comparisons?
The assignment basically ignores termination of the EM algorithm for this problem. Investigate approaches to automating this. Are the solutions found by your algorithm (as defined by the basic assignment) "converged," or will they change, perhaps slowly but substantially, if more formal convergence criteria are added?
Consider a short input sequence, vs a much longer one, especially when neither contains an especially strong match to the consensus. E.g., in the extreme where the short one is exactly as long as the presumed motif length, the whole sequence would be taken as the motif instance with probability 1, no matter how poorly it matches--perhaps a bad idea. Are there better ways to handle this?
Modify the learning algorithm to allow some prior probability, perhaps length-dependent (see above), that the sequence does not contain a motif instance. How would the algorithm change? How might you set this prior probability? Does it perform better? How would you evaluate that?

Code:

Deliverables: Your code, and a writeup (.pdf, .doc, or .rtf) containing descriptions of any extra credit steps you tried, and the results requested above (when running on my "final" datasets, not the "debug" sets):

11 entropies of the successive EM iterations on the seed sequence,
the mean and standard deviation of "y_i", and count of "abs(y_i) > 5" for each WMM,
the 2 specified frequency matrices, and
your answers to the 5 questions given in the "Evaluation" section.
Did I forget anything?

Turn-in: We are switching to Gradescope (in place of Canvas) for homework turn-ins. I think you need to log in with the email that you use for Canvas. E.g., log in using SSO by selecting the "School Credentials" option and then selecting "University of Washington NetID". On the Gradescope dashboard, you will find an assignment called "HW2". Click on the assignment, upload your code file(s) and your writeup (or a Zip archive thereof), and then Submit. Make sure your name, email, and student number are included in the writeup and in comments in your code, and clearly label any extra credit components. Include build/run instructions unless they are completely trivial. Please do not include easily rebuilt object or binary files, core dumps, large downloaded FASTA files etc.; they are bulky...

Change History:

10/27: Added turn-in instructions.


	Computer Science & Engineering University of Washington Box 352350 Seattle, WA 98195-2350 (206) 543-1695 voice, (206) 543-2969 FAX