image University of Washington Computer Science & Engineering
  CSE 427Wi '08:  Assignment #3, Due: 3/6/2008
  CSE Home   About Us    Search    Contact Info 

Reading: See Schedule page.

Homework: Protein Similarity Search

Write a program that takes a protein sequence Q as input and finds the protein S in your very own personal prokaryote (from Homework #1) that has the highest scoring local alignment with Q. Use the Smith-Waterman algorithm (from lecture slides or, e.g., Section 5.1 of the CSE 527 lecture notes) for each local alignment.

Use the BLOSUM62 scoring matrix to score any pair of amino acids. You can ignore the last two rows and columns, unless that character happens to come up in one of your proteins. Use a score of -4 for each insertion or deletion (i.e., the alignment of the gap character with an amino acid).

You can find a list of all your prokaryote's protein sequences in the .faa file on the prokaryote's NCBI Microbial FTP page. For the query protein Q, use the following amino acid sequence:

        1 mvgslnciva vsqnmgigkn gdlpwpplrn efryfqrmtt tssvegkqnl vimgkktwfs
       61 ipeknrplkg rinlvlsrel keppqgahfl srslddalkl teqpelankv dmlwivggss
      121 vykeamnhpg hlklfvtrim qdfesdtffp eidlekykll peypgvlsdv qeekgikykf
      181 evyeknd
This is actually a human protein sequence; one of the goals of this homework is to see how close a match there is to certain human proteins in your prokaryote, despite how distant humans are evolutionarily from prokaryotes.

It will actually be somewhat time-consuming to run the pairwise alignment algorithm for each of your prokaryote's protein sequences. Since your prokaryote has thousands of genes, I estimate that the total number of dynamic programming matrix entries you will need to compute is in the hundreds of millions, so be careful about efficiency. One thing you should do to save time is only compute the score of the optimal local alignment, not reconstruct the alignment itself (traceback) except once at the very end for the single best-scoring protein S that you found.

Turn-in Instructions

You will again run your program as described above on two separate genomes: once on your personal prokaryote and once on the "community prokaryote" Mycobacterium leprae. For each of these individually, you will report the protein S that has the best local alignment with Q. Turn in the following items:

  1. The approximate elapsed time it took to run your program on all the prokaryote's protein sequences.
  2. The protein identifier and name for the protein S given in the .faa file, e.g.,
    >gi|15826866|ref|NP_301129.1| chromosomal replication initiation protein [Mycobacterium leprae TN]
    
  3. A very brief description of the function of this protein. A good place to start is the PROSITE database, entering the protein name in PROSITE's Search box.
  4. The score of the optimal local alignment of Q and S.
  5. An optimal local alignment of Q and S. Present your alignment in the format of the following example:
    Q  50   YETIKVTAYRYGWIFEY------------------PNGSKSL----NTLYIKAGKLYRLE  87
            Y T+K   +++ W +EY                  PNG+  L    + + + A    R+ 
    S  95   YVTVKAMGHQWYWNYEYTDGTNVSFDSYMIQTQDLPNGAPRLLEVDHRMVMPANLQTRVV  154
    
    If the two aligned amino acids are identical, the middle line repeats that amino acid. If the BLOSUM62 scoring matrix gives a positive score to two nonidentical aligned amino acids, the middle line shows a "+". The indices at the beginning and end of each sequence show the start and end indices of the (locally) aligned sequences within their full protein sequences. If the alignment is too long to fit on one line, it is preferable to break it into multiple lines, showing, say, about 50 positions per line.

Your turn-in will consist of the following files, named as shown.

  1. match.txt:
    [your name]
    
    Mycobacterium leprae
    [running time on Mycobacterium leprae (programming language you used)]
    [protein ID and protein name for S]
    [brief description of protein S]
    [score of optimal local alignment of Q and S]
    [optimal local alignment of Q and S]
    
    [name of your personal prokaryote]
    [running time on your personal prokaryote (programming language you used)]
    [protein ID and protein name for S]
    [brief description of protein S]
    [score of optimal local alignment of Q and S]
    [optimal local alignment of Q and S]
    
  2. Source files for your program, with appropriate filenames.
  3. README: a short text file explaining how to compile and run your program.
  4. any files describing extra credit work, whose filenames should begin with the prefix "extra".

Zip these files together and upload them via the UW Catalyst system, here. If there's anything the least bit tricky about building/running your program, explain it; don't assume I'm competent...


CSE logo Computer Science & Engineering
University of Washington
Box 352350
Seattle, WA  98195-2350
(206) 543-1695 voice, (206) 543-2969 FAX