Homework #4: Protein Homology Search Using a Profile HMM


CSE 427: Computational Biology
February 23, 2009
due: Wednesday, March 11, 2009
Last update: 3/4

Write a program that trains a profile hidden Markov model (HMM) M from a multiple protein alignment, and then finds the protein S in your very own personal prokaryote (from Homework #0) that has the highest scoring Viterbi score with M.

You will use the multiple alignment of 14 proteins from Homework #2 results to train the profile HMM M. (See Figure 5.2 of Durbin et al. for the topology of a profile HMM.) Every column of this alignment that has 6 or fewer gap characters will correspond to a Match state in the HMM, and those sequences with a gap in this column will take the path through the corresponding Delete state. Every column that has 7 or more gap characters will not correspond to any Match state, and those sequences with an amino acid in this column will take the path through the corresponding Insert state.

The Viterbi score you will be using to score each of your personal prokaryote's proteins S is actually the log likelihood ratio of the Viterbi probability of S and the background probability of S. (See Equations (5.1) of Durbin et al. for the recurrences to use. All your logarithms should be computed base 2.) As in Homework #2, you can find a list of all your prokaryote's protein sequences in the .faa file(s) on the prokaryote's NCBI Microbial FTP page. Compute the background frequency qb of each of the 20 amino acids b from these files. The word "frequency" here means the fraction of amino acids that are b in the file(s); that is, the 20 values of qb should sum to 1. (Throughout this homework, treat any letter in the .faa files that does not represent one of the 20 amino acids as though it were an A.) You will need these frequencies for the recurrences and for the adjustments described in the next paragraph.

Because there are only 14 characters in each column of the alignment, when training the HMM some emission probabilities would be set to 0, but this is actually an artifact of the small sample size of 14. Instead of using the actual counts Ek(b) of the number of times b was emitted in state k, use the adjusted counts Ek(b) + qb (in both numerator and denominator) when computing the emission probability ek(b). This way, ek(b) will be a small positive number even if Ek(b) = 0. Similarly, use Ajk + Δjk rather than Ajk (in both numerator and denominator) when computing the transition probability ajk, where Δjk = 0.9 if j and k are both the same type of state (that is, both Match states, or both Insert states, or both Delete states) and Δjk = 0.05 otherwise. If j is the Begin state, use Δjk = 0.1 if k is the Match state and Δjk = 0.45 otherwise. If j is the last Delete state, use Δjk = 0.9 if k is the End state and Δjk = 0.1 otherwise. If j is the last Match or last Insert state, use Δjk = 0.1 if k is the End state and Δjk = 0.9 otherwise.

Finally, for the protein S* that has the best log likelihood ratio score, determine its most probable path through the HMM M by doing the traceback of the Viterbi algorithm. From this path, align S* to the 14 other proteins in the input alignment.

Here is an outline of your entire program, to help you get started:
Training:

  1. Read the multiple sequence alignment A into an array.
  2. Mark the columns of A (the columns with 6 or fewer gap characters) that correspond to match states of M.
  3. For each sequence R in A, follow the path of R through M determined by step 2, incrementing appropriate counts of emissions and transitions.
  4. Compute background frequencies from the .faa file(s).
  5. Compute all emission and transition probabilities.

Testing:
  1. For each sequence S in the .faa file(s), use the recurrences of Equations (5.1) from Durbin et al. to compute score(S) = VEnd(|S|). (The basis for the recurrence is VBegin(0) = 0, Vk(0) = -∞ for k ≠ Begin.)
  2. Identify S* that maximizes score(S).
  3. Trace the Viterbi maximizations backwards to find the best alignment of S* with A.

Extra credit:

Turn-in Instructions

If you experiment with any extra credit enhancements to your program, do not include those enhancements in the basic version you run for the turn-in. Instead, describe your extra credit ideas and any results in a separate file.

You will actually run your program as described above on two separate genomes: once on your personal prokaryote and once on the "community prokaryote" Shewanella oneidensis. (Be sure to use the background frequencies from the appropriate prokaryote to train and test the HMM.) For each of these two prokaryotes, you will report the protein S that has the best log likelihood ratio score. Turn in the following items:

  1. The protein identifier and name for the protein S given in the .faa file, e.g.,
    >gi|15826866|ref|NP_301129.1| chromosomal replication initiation protein [Mycobacterium leprae TN]
    
  2. A very brief description of the function of this protein. A good place to start is the PROSITE database, entering the protein name in PROSITE's Search box.
  3. The log likelihood ratio score of the best protein S.
  4. The protein identifier and name for the best-matching protein you turned in for Homework #2, in the same format as item 1 above.
  5. Your explanation for why S is the same protein or a different protein than you turned in for Homework #2.
  6. The multiple alignment of the original 14 proteins plus S as given by the Viterbi algorithm.

Your turn-in will consist of the following files, named as shown.

  1. match.txt:
    [your name]
    
    Shewanella oneidensis
    [protein ID and protein name for S]
    [brief description of protein S]
    [log likelihood ratio score of S]
    [protein ID and protein name turned in for Homework #2]
    [comparison to Homework #2 and explanation]
    [original 14-protein multiple alignment aligned with S]
    
    [name of your personal prokaryote]
    [protein ID and protein name for S]
    [brief description of protein S]
    [log likelihood ratio score of S]
    [protein ID and protein name turned in for Homework #2]
    [comparison to Homework #2 and explanation]
    [original 14-protein multiple alignment aligned with S]
    
  2. Source files for your program, with appropriate filenames.
  3. README: a short text file explaining how to compile and run your program.
  4. any files describing extra credit work, whose filenames should begin with the prefix "extra".

Submit these files to the homework drop box at https://catalysttools.washington.edu/collectit/dropbox/lachesis/4587.