Retro prof in the lab University of Washington Computer Science & Engineering
 CSE 490MT: Homologous Genes
  CSE Home   About Us    Search    Contact Info 

 CSE 490MT Home
 Project
    Suppose you were interested in finding a list of proteins whose amino acid sequences were similar to that of the metK gene in the bacterium Bacillus subtilis. (We will return later to the question of why you might guess to start at this gene.) Here's how you might go about it by hand:
  1. Start with the list of completed microbial genomes. Find the line for Bacillus subtilis and click on the P link to get to the:
  2. Protein table for B. subtilis. Search for the gene named metK and click on its PID link (marked "16080107") to get a list of:
  3. Proteins similar to metK of B. subtilis.

We're going to pick some number of the top-scoring genes from this list to form our data set. For each such gene, we will need two things: (1) the amino acid sequence and (2) the DNA sequence upstream of the gene.

(Incidentally, you don't want any 2 upstream DNA sequences that you choose to be too similar to each other. Therefore, you should never choose 2 strains of the same species. For example, #2 and #4 on the list are:

1771  8 AAS43815 42739890 S-adenosylmethionine synthetase [Bacillus cereus ATCC 10987]
1768  8 AAP11666 29898393 S-adenosylmethionine synthetase [Bacillus cereus ATCC 14579]
These are genes from 2 different strains of Bacillus cereus. Choose one or the other, but not both.)

As an example of the required data collection, consider the S-adenosylmethionine synthetase gene in Thermoanaerobacter tengcongensis, which is about #11 in the list above, with a similarity score of 1570 to metK in B. subtilis:

1570  4 AAM23768 20515476 S-adenosylmethionine synthetase [Thermoanaerobacter tengcongensis]
(To get an idea how similar this means the two proteins are, click on the 1570 score link on the page of proteins similar to metK of B. subtilis.) Here is how you might find the two required items for the case of Thermoanaerobacter tengcongensis. Find the protein table for Thermoanaerobacter tengcongensis, as above. Find the line for S-adenosylmethionine synthetase in this table:
499626..500147    +   174  20806992                    TTE0487  Rubrerythrin
500405..501592    +   396  20806993    MetK            TTE0488  S-adenosylmethionine synthetase
From the line for MetK, note the following:
  1. The coding region of the MetK gene is on the + strand, and its first codon starts at genomic position 500405. The next gene upstream ends at position 500147. Therefore the upstream DNA sequence goes from 500148 to 500404. (There are additional slight complications if this gene is on the - strand, or if the upstream sequence is too long or too short.)
  2. Click on the red diamond in the line for S-adenosylmethionine synthetase in this table. This brings you to the amino acid sequence for the protein product.
  3. Return to the line for Thermoanaerobacter tengcongensis in the list of completed microbial genomes. Click on the F link to get to the FTP site for this bacterium, and from there select the .fna file to get the complete genomic sequence for T. tengcongensis. If you extracted the substring from positions 500148 to 500404, you would have the upstream sequence for the S-adenosylmethionine synthetase gene.

Here are the sort of compilations of amino acid sequences and upstream DNA sequences you might obtain from this process. These files are in "FASTA format", which you will also need to use.


CSE logo Computer Science & Engineering
University of Washington
Box 352350
Seattle, WA  98195-2350
(206) 543-1695 voice, (206) 543-2969 FAX
[comments to tompa]