|
CSE Home | About Us | Search | Contact Info |
Reading: See Schedule page.
Homework:
Implement the Smith-Waterman local sequence alignment algorithm described in last week's lecture (and text section 2.3; not the fancy linear space version in section 2.6). (Yes, I know you can find code on the web to do this, but don't.) You may use any programming language we can run (at least C, C++, C#, Java, Perl, Python, Ruby, Matlab, R; ask if you prefer something else).
Also implement the "trace-back" procedure to display (one of) the optimal alignment(s).
Use the simple linear gap cost of -4 (an arbitrary value, not a recommended default) together with the BLOSUM62 scoring matrix from the online supplement to:
(You may ignore the B, Z, X and * rows/columns; they are various wild-cards. Also ignore any sequence characters other than the codes for the 20 standard amino acids.) This 2 page tutorial is also well worth reading (as are his other papers, found in the course reading list.)
It's fine to have the BLOSUM matrix as a built-in constant in your code, which is probably easier than writing general code to read in an arbitrary score matrix. Similarly, you don't need elaborate dynamic storage allocation if it's easier to allocate arrays of size N for some suitably large N fixed in your program. More generally, I'm not expecting elaborately engineered, highly flexible, bulletproof software for any of the programming assignments. I just want the basics, so you can focus on the core algorithm, i.e., Smith-Waterman this week.
Run your algorithm on the pair of strings "deadly" And "ddgearlyk", printing the full alignment matrix, optimal score, the alignment produced by the trace-back, and p-value resulting from permuting "ddgearlyk". (Don't print the matrix for the larger examples below.)
Align and score protein sequence 1 below against each of the other six.
Species | Protein | Name | Accession | |
---|---|---|---|---|
1 | Homo sapiens (Human) | Hemoglobin subunit beta | HBB_HUMAN | P68871 |
2 | Pan troglodytes (Chimpanzee) | Hemoglobin subunit beta | HBB_PANTR | P68873 |
3 | Mus musculus (Mouse) | Hemoglobin subunit beta 1 | HBB1_MOUSE | P02088 |
4 | Gallus gallus (Chicken) | Hemoglobin subunit beta | HBB_CHICK | P02112 |
5 | Fugu rubripes (Japanese pufferfish) | Hemoglobin beta subunit | Q802A3_FUGRU | Q802A3 |
6 | Vigna unguiculata (Cowpea) | Leghaemoglobin | Q540F0_VIGUN | Q540F0 |
7 | Homo sapiens (Human) | Insulin-like 3 precursor | INSL3_HUMAN | P51460 |
You can download the amino acid sequence for each from SwissProt. Go to the primary page for each and look for the "FASTA Format" link under "Sequence Information" near the bottom of the page. But do stop to browse the primary page a bit before you click through. What does this protein do? Do you expect it to be similar to HBB_HUMAN?
Calculate p-values for the significance of each of these alignments using the permutation method sketched in lecture (approx. 10/8 or 10/10). Do at least 1000 random permutations for each, preferably 10,000.
Extra Credit Options:
An alternate way to give p-value estimates is to align HBB_HUMAN to a large number of non-hemoglobin genes, e.g. by picking them at random from SwissProt (and hoping they aren't homologs). Try it.
Implement the global (Needleman-Wunsch) alignment algorithm, run it on the same examples, and compare/contrast results.
Implement the affine gap version of S-W and/or N-W, and compare the alignments it generates to those you got above. Try gap initiation cost -5 and gap extension cost -2. These are likely not good choices; experiment with others, tell me what you picked and why. The proteins above may not be the best choices; feel free to pick others, e.g. perhaps more remote globins in bacteria, or a different family altoghther.
Find code on the web for calculating (gapless) EVD K and lambda parameters from your score matrix. Estimate significance this way and compare to the permutation p-values above.
Use of EVD for gapless alignment is theoretically justified, but this is not known to be true for gapped alignments (at least not yet; it's a hard problem to analyze). However, empirically, it seems to work reasonably well. A complication, however, is that there isn't a handy theory available to calculate the lamdba and K parameters needed in the EVD formulae. Instead, one estimates them by scoring some random sequences, and (roughly) doing a curve-fitting procedure. The advantages of this over permutation p-values are that (1) you only have to do the randomization once for a given score table, rather than repeat it for each sequence for which you want a p-value, and (2) you can extrapolate the EVD well beyond the random data you started with, e.g., estimate p-values of 10-9 without doing 109 random trials. Eddy lays this out very nicely in this TR. Try it, and compare results to the above approaches. Eddy's "histogram.c" in his HMMer package does most of the work for you. There's also a Perl wrapper for it in BioPerl, if you find Perl more convenient (and possibly also in BioJava, BioPython, etc., I haven't looked.)
Permutation p-values would be more reliable if your permutations conserved the observed frequencies of adjacent amino acids ("di-residue" statistics) as well as conserving the overall frequency of each in isolation, as the simple permutation approach does. Dig up the method and/or code for doing this and try it.
Do a similar exploration of related sequences in some other protein family over large evolutionary distances. Proteins involved in very fundamental processes (replication, ribosome, chromatin structure, ...) are likely to be highly conserved; those more specific to "modern" developments, neuronal proteins for example, may be more rapidly evolving. What do you find?
Turn in your (commented) source code together with a brief write-up (.pdf, .doc, .rtf or plain text) summarizing your results, including alignments, scores and p-values for the example protein pairs. Say how many random trials you did for each. If there's anything the least bit tricky about building your program, explain it; you're dealing with an ivory-tower pinhead here, so don't assume I'm competent...
If you did any of the extra credit steps, say which and describe your findings in your writeup.
Bundle everything into a single .zip or .tgz file and email it to me.
If desired, you may work in groups of 2 or 3, but if so (a) you must do at least one of the extra credit steps, (b) you really need to collaborate on all steps; you won't learn anything by calling someone else's black box, and (c) clearly identify all group members in each file you turn in.
Computer Science & Engineering University of Washington Box 352350 Seattle, WA 98195-2350 (206) 543-1695 voice, (206) 543-2969 FAX |