Homework #2: Protein Similarity Search

CSE 427: Computational Biology
January 20, 2009
due: Friday, January 30, 2009
Last update: 1/20

Write a program that takes a protein sequence Q as input and finds the protein S in your very own personal prokaryote (from Homework #0) that has the highest scoring local alignment with Q. Use the algorithm from Section 5.1 of the CSE 527 lecture notes for each local alignment.

Use the BLOSUM62 scoring matrix to score any pair of amino acids. You can ignore the last two rows and columns, unless that character happens to come up in one of your proteins. Use a score of -4 for each insertion or deletion (i.e., the alignment of the gap character with an amino acid).

You can find a list of all your prokaryote's protein sequences in the .faa file on the prokaryote's NCBI Microbial FTP page. For the query protein Q, use the 260 amino acid sequence

MTHQSHAYHMVKPSPWPLTGALSALLMTSGLAMWFHFHSMTLLM
LGLLTNTLTMYQWWRDVTRESTYQGHHTPPVQKGLRYGMILFITSEVFFFAGFFWAFY
HSSLAPTPQLGGHWPPTGITPLNPLEVPLLNTSVLLASGVSITWAHHSLMENNRNQMI
QALLITILLGLYFTLLQASEYFESPFTISDGIYGSTFFVATGFHGLHVIIGSTFLTIC
FIRQLMFHFTSKHHFGFEAAAWYWHFVDVVWLFLYVSIYWWG

This is actually a human protein sequence; one of the goals of this homework is to see how close a match there is to certain human proteins in your prokaryote, despite how distant humans are evolutionarily from prokaryotes.

It will actually be somewhat time-consuming to run the pairwise alignment algorithm for each of your prokaryote's protein sequences. Since your prokaryote has thousands of genes, I estimate that the total number of dynamic programming matrix entries you will need to compute is in the hundreds of millions, so be careful about efficiency. One thing you should do to save time is only compute the score of the optimal local alignment, not reconstruct the alignment itself except once at the very end for the single best-scoring protein S that you found.

Extra credit:

Devise a way to preprocess all the protein sequences of your prokaryote so that it is then much faster, given a new query protein Q, to find the protein S most similar to Q among all your prokaryote's protein sequences.
Your ideas here.

Turn-in Instructions

If you experiment with any extra credit enhancements to your program, do not include those enhancements in the basic version you run for the turn-in. Instead, describe your extra credit ideas and any results in a separate file.

You will actually run your program as described above on two separate genomes: once on your personal prokaryote and once on the "community prokaryote" Shewanella oneidensis. For each of these individually, you will report the protein S that has the best local alignment with Q. Turn in the following items:

The approximate elapsed time (in seconds) it took to run your program on all the prokaryote's protein sequences.

The protein identifier and name for the protein S given in the .faa file, e.g.,

>gi|15826866|ref|NP_301129.1| chromosomal replication initiation protein [Mycobacterium leprae TN]

A very brief description of the function of this protein. A good place to start is the PROSITE database, entering the protein name in PROSITE's Search box.
The score of the optimal local alignment of Q and S.
An optimal local alignment of Q and S. Present your alignment in the format of the following example:
```
Q  50   YETIKVTAYRYGWIFEY------------------PNGSKSL----NTLYIKAGKLYRLE  87
        Y T+K   +++ W +EY                  PNG+  L    + + + A    R+ 
S  95   YVTVKAMGHQWYWNYEYTDGTNVSFDSYMIQTQDLPNGAPRLLEVDHRMVMPANLQTRVV  154
```
If the two aligned amino acids are identical, the middle line repeats that amino acid. If the BLOSUM62 scoring matrix gives a positive score to two nonidentical aligned amino acids, the middle line shows a "+". The indices at the beginning and end of each sequence show the start and end indices of the (locally) aligned sequences within their full protein sequences. If the alignment is too long to fit on one line, break it into multiple lines.

Your turn-in will consist of the following files, named as shown.

match.txt:

[your name]

Shewanella oneidensis
[running time on Shewanella oneidensis]
[protein ID and protein name for S]
[brief description of protein S]
[score of optimal local alignment of Q and S]
[optimal local alignment of Q and S]

[name of your personal prokaryote]
[running time on your personal prokaryote]
[protein ID and protein name for S]
[brief description of protein S]
[score of optimal local alignment of Q and S]
[optimal local alignment of Q and S]

Source files for your program, with appropriate filenames.
README: a short text file explaining how to compile and run your program.
any files describing extra credit work, whose filenames should begin with the prefix "extra".

Submit these files to the homework drop box at https://catalysttools.washington.edu/collectit/dropbox/lachesis/4587.