Write a program that trains a profile hidden Markov model (HMM) M from a multiple protein alignment, and then finds the protein in a given prokaryote that has the highest scoring Viterbi score with respect to M.
You will use the multiple alignment A of 12 proteins from Homework #2 results to train the profile HMM M. (See Figure 5.2 of Durbin et al. for the topology of a profile HMM.) Rather than running the Viterbi algorithm on your own personal prokaryote and the community prokaryote (for which most of you already found a great match in Homework #2), you will all run it on the two personal prokaryotes for which the match scores were lowest in Homework #2, Oligotropha carboxidovorans OM4 and Polymorphum gilvum SL003B-26A1, to see if these prokaryotes have any better matches to this protein family. It will be exciting to see whether this HMM approach can find anything better than the Homework #2 approach.
Every column of this alignment A that has 6 or fewer gap characters will correspond to a Match state in the HMM, and those sequences with a gap in this column will take the path through the corresponding Delete state. Every column that has 7 or more gap characters will not correspond to any Match state, and those sequences with an amino acid in this column will take the path through the corresponding Insert state.
The Viterbi score you will be using to score each of your prokaryote's proteins S is the log likelihood ratio of the Viterbi probability of S and the background probability of S. See Equations (5.1) of Durbin et al. for the recurrences to use. You will need to modify this set of recurrences slightly: you need to add a recurrence with VEnd(i) on the left hand side (hint: the End state has no emissions, so this recurrence looks very much like the recurrence for the Delete states) and you need to modify the recurrences for the states I0, M1, and D1 (hint: these each have only two inputs, one of which is the Begin state). The basis for the recurrence is VBegin(0) = 0, Vk(0) = -∞ if k is any Match, Insert, or End state, and VBegin(i) = -∞ for i > 0. All your logarithms should be computed base 2.
As in Homework #2, you can find a list of all the prokaryote's protein sequences in the .faa file(s) on the prokaryote's NCBI Microbial FTP page. Compute the background frequency qb of each of the 20 amino acids b from these files. The word "frequency" here means the fraction of amino acids that are b in the file(s); that is, the 20 values of qb should sum to 1. (Throughout this homework, treat any letter in the .faa files that does not represent one of the 20 amino acids as though it were an A.) You will need these frequencies for the recurrences and for the adjustments described in the next paragraph.
Because there are only 12 characters in each column of the alignment, when training the HMM some emission probabilities would be set to 0, but this is actually an artifact of the small sample size of 12. Instead of using the actual counts Ek(b) of the number of times b was emitted in state k, use the adjusted counts Ek(b) + qb (in both numerator and denominator) when computing the emission probability ek(b). This way, ek(b) will be a small positive number even if Ek(b) = 0. Similarly, use Ajk + Δjk rather than Ajk (in both numerator and denominator) when computing the transition probability ajk, where Δjk = 0.9 if j and k are both the same type of state (that is, both Match states, or both Insert states, or both Delete states) and Δjk = 0.05 otherwise. If j is the Begin state, use Δjk = 0.1 if k is the Match state and Δjk = 0.45 otherwise. If j is the last Delete state, use Δjk = 0.9 if k is the End state and Δjk = 0.1 otherwise. If j is the last Match or last Insert state, use Δjk = 0.1 if k is the End state and Δjk = 0.9 otherwise.
Finally, for the protein S* that has the best log likelihood ratio score, determine its most probable path through the HMM M by doing the traceback of the Viterbi algorithm. From this path, align S* to the 12 other proteins in the input alignment.
Here is an outline of your entire program, to help you get started:
Training:
You will actually run your program as described above on two separate genomes, Oligotropha carboxidovorans OM4 and Polymorphum gilvum SL003B-26A1. (Be sure to use the background frequencies from the appropriate prokaryote to train and test the HMM.) For each of these two prokaryotes, you will report the protein that has the best log likelihood ratio score. Turn in the following items:
>gi|15826866|ref|NP_301129.1| chromosomal replication initiation protein [Mycobacterium leprae TN]
M_0,M_1,I_1,I_1,I_1,M_2,D_3,D_4,M_5,...,M_601M_0 is the Begin state and M_601 is the End state if there are (in this example) 600 columns with 6 or fewer gap characters in the alignment. If the path goes through an Insert state I_j x times, then I_j should be repeated x times in your path. The path should be on a single line of your file.
Your turn-in will consist of the following files, named as shown.
[your name]
[one blank line]
[protein ID and protein name for S*]
[brief description of protein S*]
[log likelihood ratio score of S*]
[Viterbi path for S*]
[comparison to Homework #2 and explanation]
[original 12-protein multiple alignment aligned with S*]
Submit these files to the homework drop box at https://catalyst.uw.edu/collectit/dropbox/tompa/25138.