CSE P590A, Su '06: Assignment #2, Due: 7/13/06

University of Washington Computer Science & Engineering

CSE Home

About Us

Contact Info

Reading: See Schedule page.

Homework:

Implement the Smith-Waterman local sequence alignment algorithm described in last week's lecture (and text section 2.3; not the fancy linear space version in section 2.6). (Yes, I know you can find code on the web to do this, but don't.) You may use C, C++, C#, Java, Perl, Python or Ruby; ask if you prefer something else.
Also implement the "trace-back" procedure to display (one of) the optimal alignment(s).
Use the simple linear gap cost of -4 (an arbitrary value, not a recommended default) together with the BLOSUM62 scoring matrix from the online supplement to:
- SR Eddy, "Where did the BLOSUM62 alignment score matrix come from?" Nat. Biotechnol., 22, #8 (2004) 1035-6. [offcampus]
(You may ignore the B, Z, X and * rows/columns; they are various wild-cards. Also ignore any sequence characters other than the codes for the 20 standard amino acids.) This 2 page tutorial is also well worth reading, as are his other papers in this series, especially:
- SR Eddy, "What is dynamic programming?" Nat. Biotechnol., 22, #7 (2004) 909-10. [offcampus]
- SR Eddy, "What is Bayesian statistics?" Nat. Biotechnol., 22, #9 (2004) 1177-8. [offcampus]
and (in a week or three):
- SR Eddy, "What is a hidden Markov model?" Nat. Biotechnol., 22, #10 (2004) 1315-6. [offcampus]
- SR Eddy, "How do RNA folding algorithms work?" Nat. Biotechnol., 22, #11 (2004) 1457-8. [offcampus]
Run your algorithm on the pair of strings "deadly" And "ddgearlyk", printing the full alignment matrix, optimal score, the alignment produced by the trace-back, and p-value resulting from permuting "ddgearlyk". (Don't print the matrix for the larger examples below.)

Align and score protein sequence 1 below against each of the other six.

	Species	Protein	Name	Accession
1	Homo sapiens (Human)	Hemoglobin subunit beta	HBB_HUMAN	P68871
2	Pan troglodytes (Chimpanzee)	Hemoglobin subunit beta	HBB_PANTR	P68873
3	Mus musculus (Mouse)	Hemoglobin subunit beta 1	HBB1_MOUSE	P02088
4	Gallus gallus (Chicken)	Hemoglobin subunit beta	HBB_CHICK	P02112
5	Fugu rubripes (Japanese pufferfish)	Hemoglobin beta subunit	Q802A3_FUGRU	Q802A3
6	Vigna unguiculata (Cowpea)	Leghaemoglobin	Q540F0_VIGUN	Q540F0
7	Homo sapiens (Human)	Insulin-like 3 precursor	INSL3_HUMAN	P51460

You can download the amino acid sequence for each from SwissProt. Go to the primary page for each and look for the "FASTA Format" link under "Sequence Information" near the bottom of the page. But do stop to browse the primary page a bit before you click through. What does this protein do? Do you expect it to be similar to HBB_HUMAN?

Calculate p-values for the significance of each of these alignments using the permutation method sketched in lecture 3. Do at least 1000 random permutations for each, preferably 10,000.
Extra Credit Options:
1. Implement the global (Needleman-Wunsch) alignment algorithm, run it on the same examples, and compare/contrast results.
2. Implement the affine gap version of S-W and/or N-W, and compare the alignments it generates to those you got above. Try gap initiation cost -5 and gap extension cost -2. These are likely not good choices; experiment with others, tell me what you picked and why. The proteins above may not be the best choices; feel free to pick others, e.g. perhaps more remote globins in bacteria, or a different family altoghther.
3. Find code on the web for calculating (gapless) EVD K and lambda parameters from your score matrix. Estimate significance this way and compare to the permutation p-values above.
4. Use of EVD for gapless alignment is theoretically justified, but this is not known to be true for gapped alignments (at least not yet; it's a hard problem to analyze). However, empirically, it seems to work reasonably well. A complication, however, is that there isn't a handy theory available to calculate the lamdba and K parameters needed in the EVD formulae. Instead, one estimates them by scoring some random sequences, and (roughly) doing a curve-fitting procedure. The advantages of this over permutation p-values are that (1) you only have to do the randomization once for a given score table, rather than repeat it for each sequence for which you want a p-value, and (2) you can extrapolate the EVD well beyond the random data you started with, e.g., estimate p-values of 10^-9 without doing 10⁹ random trials. Eddy lays this out very nicely in this TR. Try it, and compare results to the above approaches. Eddy's "histogram.c" in his HMMer package does most of the work for you. There's also a Perl wrapper for it in BioPerl, if you find Perl more convenient (and possibly also in BioJava, BioPython, etc., I haven't looked.)
5. Permutation p-values would be more reliable if your permutations conserved the observed frequencies of adjacent amino acids ("di-residue" statistics) as well as conserving the overall frequency of each in isolation, as the simple permutation approach does. Dig up the method and/or code for doing this and try it.
6. Do a similar exploration of related sequences in some other protein family over large evolutionary distances. Proteins involved in very fundamental processes (replication, ribosome, chromatin structure, ...) are likely to be highly conserved; those more specific to "modern" developments, neuronal proteins for example, may be more rapidly evolving. What do you find?
Post a note on the class message board addressing the following theme. Suppose you were interested in studying hemoglobin. Are steps 1-6 above a reasonable part of such a study? What's good? What's bad? How would you change it? (Again, see course FAQ for general info on our expectations for and grading of these posts.)
Turn in your (commented) source code together with a brief write-up (.pdf, .doc, .rtf or plain text) summarizing your results, including alignments, scores and p-values for the example protein pairs. Say how many random trials you did for each. If there's anything the least bit tricky about building your program, explain it; you're dealing with an ivory-tower pinhead here, so don't assume I'm competent...

If you did any of the extra credit steps, say which and describe your findings in your writeup.

Bundle everything into a single .zip or .tgz file and UPLOAD IT HERE.

I prefer that it be turned in before class on 7/13, but in any case, turn it in by 11:59PM 7/15.
After you finish the weekly homework, please respond to this simple Non-Anonymous Poll

If desired, you may work in groups of 2 or 3, but if so (a) you must do at least one of the extra credit steps, (b) you really need to collaborate on all steps; you won't learn anything by calling someone else's black box, and (c) clearly identify all group members in each file you turn in.


	Computer Science & Engineering University of Washington Box 352350 Seattle, WA 98195-2350 (206) 543-1695 voice, (206) 543-2969 FAX [comments to csep590a-webmaster at cs.washington.edu]