Retro prof in the lab University of Washington Computer Science & Engineering
 CSE 490MT: Upstream Sequences
  CSE Home   About Us    Search    Contact Info 

 Homologous genes
 CSE 490MT home
    Before reading this page, you should be comfortable with the ordinary case of retrieving the upstream sequence. On this page, you will learn how to handle
  • genes on the negative strand,
  • upstream sequences that are too long, and
  • upstream sequences that are too short.

    Genes on the negative strand

    Suppose you want the upstream sequence for metK in B. subtilis. Start with the Protein table for B. subtilis. Search for the gene named metK to find the following lines:

    3126906..3128108  -  401  16080107  metK  COG0192  Bsu3050  S-adenosylmethionine synthetase
    3128611..3130194  +  528  16080108  pckA  COG1866  Bsu3051  phosphoenolpyruvate carboxykinase
    

    This time the coding region of the gene is on the - strand, so everything is reversed: its first codon starts at genomic position 3128108 and the position decreases as you read the coding region from 5' to 3' on the - strand, until the end of the stop codon at position 3126906. The next gene upstream (with respect to the - strand) is at position 3128611. Therefore the upstream DNA sequence goes from 3128610 to 3128109 on the - strand. If you fetch the genomic sequence 3128109-3128610, you will get:

    >gi|16077068:3128109-3128610 Bacillus subtilis, complete genome
    GATTTGCTTCCTCCTGCACAAGGCCTCCCGAAAGACCTTGTATATATGATACGGAACTCGCTCCCTCTTA
    TACAATGTACAGTTATATTAGAGAATGTTAATTGGCATATTTATGAAATAAAAAAACCTTTTCCATCGAG
    GAAAGGGTTTGGTCTTTGTGCCTTTCACTCTTATCGCTCAAGGAATCATACAACCTTGCAACAGGTTAGC
    ACCTTGGTTGTCTCACTCAGTTGAACATAATAAATAACAGAGAAACCGGTTGCTGGGCTTCATAGGGCCT
    GTCCCTCCGCCAGCTCGGGATAAGAGTATCCGCTCAATGAAATATCTTATCGTAAAAGGGTTTGCAATGT
    CAATATGATTCAGAAGAAATAGGCACCTATATTGAGGGAAAACAATGGAAATGCACACACAAAAAACAAT
    AAATAGTATAGACTATTTGAAAATATATGTTATACTAATTCACAATTAGCAAAACACAAAAAACGATAAA
    GGAAGGTTTCAT
    

    But the sequence retrieved from the genome is, by convention, always the sequence on the + strand, so you must compute its reverse complement in order to get the upstream sequence of metK on the - strand. "Reverse complement", of course, means reverse the string and then change every nucleotide to its complementary nucleotide. The resulting upstream sequence then would be:

    >Bacillus_subtilis upstream sequence for metK
    ATGAAACCTTCCTTTATCGTTTTTTGTGTTTTGCTAATTGTGAATTAGTATAACATATATTTTCAAATAGT
    CTATACTATTTATTGTTTTTTGTGTGTGCATTTCCATTGTTTTCCCTCAATATAGGTGCCTATTTCTTCTG
    AATCATATTGACATTGCAAACCCTTTTACGATAAGATATTTCATTGAGCGGATACTCTTATCCCGAGCTGG
    CGGAGGGACAGGCCCTATGAAGCCCAGCAACCGGTTTCTCTGTTATTTATTATGTTCAACTGAGTGAGACA
    ACCAAGGTGCTAACCTGTTGCAAGGTTGTATGATTCCTTGAGCGATAAGAGTGAAAGGCACAAAGACCAAA
    CCCTTTCCTCGATGGAAAAGGTTTTTTTATTTCATAAATATGCCAATTAACATTCTCTAATATAACTGTAC
    ATTGTATAAGAGGGAGCGAGTTCCGTATCATATATACAAGGTCTTTCGGGAGGCCTTGTGCAGGAGGAAGC
    AAATC
    

    Upstream sequences that are too long

    If the upstream sequence constructed by the process described to this point is longer than 500 bp, just use the 500 bp closest to the start of the gene of interest.

    Upstream sequences that are too short: Operons

    Something that distinguishes prokaryotes (including the bacteria) from eukaryotes is the existence of operons. An operon is a set of 2 or more genes on the same DNA strand that are transcribed together into a single mRNA molecule. In this case, the regulatory region controlling the transcription of all the genes in the operon lies upstream of the first (5'-most) gene in the operon. This means that, if the gene of interest lies in the middle of an operon, the relevant upstream region may not lie immediately upstream of it.

    Unfortunately, there is no simple rule for predicting when genes form an operon. But a good rule of thumb is that, if there is a very short noncoding region (say, less than 100 bp) upstream of the gene of interest, and the next gene upstream is on the same strand, then you may well be in the middle of an operon. To hedge your bets, what I would recommend in this case is to concatenate together all those short (less than 100 bp) intergenic regions, together with the first intergenic region longer than 100 bp encountered, provided all the genes are on the same strand.

    As an example, suppose you were looking for the upstream region of the ykrZ gene in B. subtilis. Here are some of the nearby entries in its protein table:

    1425107..1426303   +   399  16078422    ykrV  COG0436   Bsu1360  unknown
    1426500..1427744   +   415  16078423    ykrW  COG1850   Bsu1361  unknown
    1427741..1428448   +   236  16078424    ykrX  COG4359   Bsu1362  unknown
    1428406..1429035   +   210  16078425    ykrY  COG0235   Bsu1363  unknown
    1429050..1429586   +   179  16078426    ykrZ  COG1791   Bsu1364  unknown
    

    Note that there are only 14 bp of noncoding DNA upstream of ykrZ, none upstream of ykrY or ykrX (yes, these coding regions actually overlap according to the annotation), and 196 bp upstream of ykrW. Furthermore, all 4 of these genes lie on the same + strand. Therefore, for the upstream region of ykrZ, I would concatenate the DNA in the interval 1426304-1426499 with the DNA in the interval 1429036-1429049.

    Note that, if one of the genes ykrW, ykrX, or ykrY had been on the opposite strand, then it would not make sense to assume these 4 genes were all in the same operon, and you would have to use just 1429036-1429049 as the upstream region.


CSE logo Computer Science & Engineering
University of Washington
Box 352350
Seattle, WA  98195-2350
(206) 543-1695 voice, (206) 543-2969 FAX
[comments to tompa]