upstream sequences that are too short.
Genes on the negative strand
Suppose you want the upstream sequence for metK in B. subtilis.
Start with the Protein
table for B. subtilis. Search for the gene named metK to
find the following lines:
3126906..3128108 - 401 16080107 metK COG0192 Bsu3050 S-adenosylmethionine synthetase
3128611..3130194 + 528 16080108 pckA COG1866 Bsu3051 phosphoenolpyruvate carboxykinase
This time the coding region of the gene is on the - strand, so
everything is reversed: its first codon
starts at genomic position 3128108 and the position decreases
as you read the coding region from 5' to 3' on the - strand, until the
end of the stop codon at position
3126906. The next gene upstream (with respect to the - strand) is at
position 3128611. Therefore the upstream DNA sequence goes from
3128610 to 3128109 on the - strand. If you fetch the genomic sequence
3128109-3128610, you will get:
>gi|16077068:3128109-3128610 Bacillus subtilis, complete genome
GATTTGCTTCCTCCTGCACAAGGCCTCCCGAAAGACCTTGTATATATGATACGGAACTCGCTCCCTCTTA
TACAATGTACAGTTATATTAGAGAATGTTAATTGGCATATTTATGAAATAAAAAAACCTTTTCCATCGAG
GAAAGGGTTTGGTCTTTGTGCCTTTCACTCTTATCGCTCAAGGAATCATACAACCTTGCAACAGGTTAGC
ACCTTGGTTGTCTCACTCAGTTGAACATAATAAATAACAGAGAAACCGGTTGCTGGGCTTCATAGGGCCT
GTCCCTCCGCCAGCTCGGGATAAGAGTATCCGCTCAATGAAATATCTTATCGTAAAAGGGTTTGCAATGT
CAATATGATTCAGAAGAAATAGGCACCTATATTGAGGGAAAACAATGGAAATGCACACACAAAAAACAAT
AAATAGTATAGACTATTTGAAAATATATGTTATACTAATTCACAATTAGCAAAACACAAAAAACGATAAA
GGAAGGTTTCAT
But the sequence retrieved from the genome is, by convention, always
the sequence on the + strand, so you must compute its reverse
complement in order to get the upstream sequence of metK on the -
strand. "Reverse complement", of course, means reverse the string and
then change every nucleotide to its complementary nucleotide. The
resulting upstream sequence then would be:
>Bacillus_subtilis upstream sequence for metK
ATGAAACCTTCCTTTATCGTTTTTTGTGTTTTGCTAATTGTGAATTAGTATAACATATATTTTCAAATAGT
CTATACTATTTATTGTTTTTTGTGTGTGCATTTCCATTGTTTTCCCTCAATATAGGTGCCTATTTCTTCTG
AATCATATTGACATTGCAAACCCTTTTACGATAAGATATTTCATTGAGCGGATACTCTTATCCCGAGCTGG
CGGAGGGACAGGCCCTATGAAGCCCAGCAACCGGTTTCTCTGTTATTTATTATGTTCAACTGAGTGAGACA
ACCAAGGTGCTAACCTGTTGCAAGGTTGTATGATTCCTTGAGCGATAAGAGTGAAAGGCACAAAGACCAAA
CCCTTTCCTCGATGGAAAAGGTTTTTTTATTTCATAAATATGCCAATTAACATTCTCTAATATAACTGTAC
ATTGTATAAGAGGGAGCGAGTTCCGTATCATATATACAAGGTCTTTCGGGAGGCCTTGTGCAGGAGGAAGC
AAATC
Upstream sequences that are too long
If the upstream sequence constructed by the process described to this
point is longer than 500 bp, just use the 500 bp closest to the start
of the gene of interest.
Upstream sequences that are too short: Operons
Something that distinguishes prokaryotes (including the bacteria) from
eukaryotes is the existence of operons. An operon is a set of
2 or more genes on the same DNA strand that are transcribed together
into a single mRNA molecule. In this case, the regulatory region
controlling the transcription of all the genes in the operon lies
upstream of the first (5'-most) gene in the operon. This means that,
if the gene of interest lies in the middle of an operon, the relevant
upstream region may not lie immediately upstream of it.
Unfortunately, there is no simple rule for predicting when genes form
an operon. But a good rule of thumb is that, if there is a very
short noncoding region (say, less than 100 bp) upstream of the gene of
interest, and the next gene upstream is on the same strand, then you
may well be in the middle of an operon. To hedge your bets, what I
would recommend in this case is to concatenate together all those
short (less than 100 bp) intergenic regions, together with the first
intergenic region longer than 100 bp encountered, provided all the
genes are on the same strand.
As an example, suppose you were looking for the upstream region of the
ykrZ gene in B. subtilis. Here are some of the nearby entries
in its protein
table:
1425107..1426303 + 399 16078422 ykrV COG0436 Bsu1360 unknown
1426500..1427744 + 415 16078423 ykrW COG1850 Bsu1361 unknown
1427741..1428448 + 236 16078424 ykrX COG4359 Bsu1362 unknown
1428406..1429035 + 210 16078425 ykrY COG0235 Bsu1363 unknown
1429050..1429586 + 179 16078426 ykrZ COG1791 Bsu1364 unknown
Note that there are only 14 bp of noncoding DNA upstream of ykrZ, none
upstream of ykrY or ykrX (yes, these coding regions actually overlap
according to the annotation), and 196 bp upstream of ykrW.
Furthermore, all 4 of these genes lie on the same + strand.
Therefore, for the upstream region of ykrZ, I would concatenate the
DNA in the interval 1426304-1426499 with the DNA in the
interval 1429036-1429049.
Note that, if one of the genes ykrW, ykrX, or ykrY had been on the opposite
strand, then it would not make sense to assume these 4 genes were
all in the same operon, and you would have to use just 1429036-1429049
as the upstream region.