Write a program that takes a protein sequence Q as input and finds the protein S in your very own personal prokaryote (from Homework #0) that has the highest scoring local alignment with Q. Use the algorithm from Section 5.1 of the CSE 527 lecture notes for each local alignment.
Use the BLOSUM62 scoring matrix to score any pair of amino acids. You can ignore the last two rows and columns, unless that character happens to come up in one of your proteins. Use a score of -4 for each insertion or deletion (i.e., the alignment of the gap character with an amino acid).
You can find a list of all your prokaryote's protein sequences in the .faa file on the prokaryote's NCBI Microbial FTP page. For the query protein Q, use the 260 amino acid sequence
MTHQSHAYHMVKPSPWPLTGALSALLMTSGLAMWFHFHSMTLLM LGLLTNTLTMYQWWRDVTRESTYQGHHTPPVQKGLRYGMILFITSEVFFFAGFFWAFY HSSLAPTPQLGGHWPPTGITPLNPLEVPLLNTSVLLASGVSITWAHHSLMENNRNQMI QALLITILLGLYFTLLQASEYFESPFTISDGIYGSTFFVATGFHGLHVIIGSTFLTIC FIRQLMFHFTSKHHFGFEAAAWYWHFVDVVWLFLYVSIYWWGThis is actually a human protein sequence; one of the goals of this homework is to see how close a match there is to certain human proteins in your prokaryote, despite how distant humans are evolutionarily from prokaryotes.
It will actually be somewhat time-consuming to run the pairwise alignment algorithm for each of your prokaryote's protein sequences. Since your prokaryote has thousands of genes, I estimate that the total number of dynamic programming matrix entries you will need to compute is in the hundreds of millions, so be careful about efficiency. One thing you should do to save time is only compute the score of the optimal local alignment, not reconstruct the alignment itself except once at the very end for the single best-scoring protein S that you found.
Extra credit:
You will actually run your program as described above on two separate genomes: once on your personal prokaryote and once on the "community prokaryote" Shewanella oneidensis. For each of these individually, you will report the protein S that has the best local alignment with Q. Turn in the following items:
>gi|15826866|ref|NP_301129.1| chromosomal replication initiation protein [Mycobacterium leprae TN]
Q 50 YETIKVTAYRYGWIFEY------------------PNGSKSL----NTLYIKAGKLYRLE 87 Y T+K +++ W +EY PNG+ L + + + A R+ S 95 YVTVKAMGHQWYWNYEYTDGTNVSFDSYMIQTQDLPNGAPRLLEVDHRMVMPANLQTRVV 154If the two aligned amino acids are identical, the middle line repeats that amino acid. If the BLOSUM62 scoring matrix gives a positive score to two nonidentical aligned amino acids, the middle line shows a "+". The indices at the beginning and end of each sequence show the start and end indices of the (locally) aligned sequences within their full protein sequences. If the alignment is too long to fit on one line, break it into multiple lines.
Your turn-in will consist of the following files, named as shown.
[your name] Shewanella oneidensis [running time on Shewanella oneidensis] [protein ID and protein name for S] [brief description of protein S] [score of optimal local alignment of Q and S] [optimal local alignment of Q and S] [name of your personal prokaryote] [running time on your personal prokaryote] [protein ID and protein name for S] [brief description of protein S] [score of optimal local alignment of Q and S] [optimal local alignment of Q and S]
Submit these files to the homework drop box at https://catalysttools.washington.edu/collectit/dropbox/lachesis/4587.