CSE 527, Au '09: Assignment #1, Due: 10/19/2009

University of Washington Computer Science & Engineering

CSE Home

About Us

Contact Info

Reading: See Schedule page.

Homework:

Implement the Needleman-Wunsch global sequence alignment algorithm described in lecture (and text section 2.3; not the fancy linear space version in section 2.6). (Yes, I know you can find code on the web to do this, but don't.) For programming language, you may use any of C, C++, C#, Java, Perl, Python, Ruby, R; ask if you prefer something else. Please avoid proprietary or nonstandard frameworks, libraries or packages; I want to be able to run your program. (Libraries like BioPerl are not needed, but allowed. Obviously, avoid their alignment code, but if you want to use them, e.g. to download the relevant sequence files, that's fine.)
Also implement the "trace-back" procedure to display (one of) the optimal alignment(s).
Use the simple linear gap cost of -4 (an arbitrary value, not a recommended default) together with the BLOSUM62 scoring matrix from the online supplement to:
- SR Eddy, "Where did the BLOSUM62 alignment score matrix come from?" Nat. Biotechnol., 22, #8 (2004) 1035-6. [offcampus]
(You may ignore the B, Z, X and * rows/columns; they are various wild-cards. Also ignore any sequence characters other than the codes for the 20 standard amino acids.) This 2 page tutorial is required reading. (Eddy's other "What is..." papers, found in the course reading list, are also highly recommended.)
Caution: I haven't checked lately, but for several consecutive years when I did, the top Google hit for "BLOSUM62" was a nice little incorrect table. Maybe Bing will be better... Until you're sure, use only the trusted source above.

Deliverables:

Your (commented) source code.
Run your algorithm on the pair of strings "deadly" and "ddgearlyk", printing the full alignment matrix, optimal score, the alignment produced by the trace-back, and p-value resulting from permuting "ddgearlyk". (Don't print the matrix for the larger examples below.)

Align and score protein sequence 1 below against each of the other six.

	Species	Protein	Name	Accession
1	Homo sapiens (Human)	Hemoglobin subunit beta	HBB_HUMAN	P68871
2	Pan troglodytes (Chimpanzee)	Hemoglobin subunit beta	HBB_PANTR	P68873
3	Mus musculus (Mouse)	Hemoglobin subunit beta 1	HBB1_MOUSE	P02088
4	Gallus gallus (Chicken)	Hemoglobin subunit beta	HBB_CHICK	P02112
5	Fugu rubripes (Japanese pufferfish)	Hemoglobin beta subunit	Q802A3_FUGRU	Q802A3
6	Vigna unguiculata (Cowpea)	Leghaemoglobin	Q540F0_VIGUN	Q540F0
7	Homo sapiens (Human)	Insulin-like 3 precursor	INSL3_HUMAN	P51460

You can download the amino acid sequence for each from SwissProt. Select "UniProtKB" in the search pulldown and paste in the accession number. Look for the "FASTA" link under "Sequences" near the middle of the page. But do stop to browse the primary page a bit before you click through. What does this protein do? Do you expect it to be similar to HBB_HUMAN?

Pick an eighth sequence by using NCBI BLAST at http://blast.ncbi.nlm.nih.gov/Blast.cgi to look for a remote relative of HBB_HUMAN. Tell me what it is and score its alignment to HBB_HUMAN as above.
Calculate and report p-values for the significance of each of these alignments using the permutation method sketched in lecture. Do at least 1000 random permutations for each, preferably 10,000 or more.

Extra Credit Options:
1. An alternate way to give p-value estimates is to align HBB_HUMAN to a large number of non-hemoglobin genes, e.g. by picking them at random from SwissProt (and hoping they aren't homologs). Try it. Perhaps give a score histogram for random permutations and another for random SwissProt sequences. Does there seem to be a bias in scores from one versus the other?
2. Implement the local (Smith-Waterman) alignment algorithm, run it on the same examples, and compare/contrast results.
3. Implement the affine gap version of S-W and/or N-W, and compare the alignments it generates to those you got above. Try gap initiation cost -5 and gap extension cost -2. These are likely not good choices; experiment with others, tell me what you picked and why. The proteins above may not be the best choices; feel free to pick others, e.g. perhaps more remote globins in bacteria, or a different family altoghther.
4. Find code on the web for calculating (gapless) EVD K and lambda parameters from your score matrix. Estimate significance this way and compare to the permutation p-values above.
5. Use of EVD for gapless alignment is theoretically justified, but this is not known to be true for gapped alignments (at least not yet; it's a hard problem to analyze). However, empirically, it seems to work reasonably well. A complication, however, is that there isn't a handy theory available to calculate the lamdba and K parameters needed in the EVD formulae. Instead, one estimates them by scoring some random sequences, and (roughly) doing a curve-fitting procedure. The advantages of this over permutation p-values are that (1) you only have to do the randomization once for a given score table, rather than repeat it for each sequence for which you want a p-value, and (2) you can extrapolate the EVD well beyond the random data you started with, e.g., estimate p-values of 10^-9 without doing 10⁹ random trials. Eddy lays this out very nicely in this TR. Try it, and compare results to the above approaches. Eddy's "histogram.c" in his HMMer package does most of the work for you. There's also a Perl wrapper for it in BioPerl, if you find Perl more convenient (and possibly also in BioJava, BioPython, etc., I haven't looked.)
6. Permutation p-values would be more reliable if your permutations conserved the observed frequencies of adjacent amino acids ("di-residue" statistics) as well as conserving the overall frequency of each in isolation, as the simple permutation approach does. Dig up the method and/or code for doing this, try it and compare.
7. Do a similar exploration of related sequences in some other protein family over large evolutionary distances. Proteins involved in very fundamental processes (replication, ribosome, chromatin structure, ...) are likely to be highly conserved; those more specific to "modern" developments, neuronal proteins for example, may be more rapidly evolving. What do you find?
Notes: It's fine to have the BLOSUM matrix as a built-in constant in your code, which is probably easier than writing general code to read in an arbitrary score matrix. Similarly, you don't need elaborate dynamic storage allocation if it's easier to allocate arrays of size N for some suitably large N fixed in your program. More generally, I'm not expecting elaborately engineered, highly flexible, bulletproof software or fancy GUIs for any of the programming assignments. That's fine if you want, but I'll just look at the basics, so you can focus on the core algorithm, i.e., sequence alignment this week.
Turn in: your source code together with a brief write-up (.pdf, .doc, .rtf or plain text) summarizing your results, including alignments, scores and p-values for the example protein pairs as outlined above. Say how many random trials you did for each.

If you did any of the extra credit steps, say which and describe your findings in your writeup.

Bundle everything into a single .zip or .tgz file and UPLOAD IT HERE. Make sure your name is included in the writeup, and clearly label any extra credit components. Include build/run instructions unless they are completely trivial.


	Computer Science & Engineering University of Washington Box 352350 Seattle, WA 98195-2350 (206) 543-1695 voice, (206) 543-2969 FAX