You will use, modify, and extend a program to compute the GC content of DNA data. The GC content of DNA is the percentage of nucleotides that are either G or C.

DNA is made up of a sequence of nucleotides. Each nucleotide is adenine, cytosine, guanine, or thymine. These are abbreviated as A, C, G, and T. A nucleotide is alternately called a nucleotide base, nitrogenous base, nucleobase, or just a base. In this assignment specification we will primarily use “nucleotide.”

Biologists have multiple reasons to be interested in GC content:

  • GC content can identify the location of genes within DNA, and can identify types of genes. Genes tend to have higher GC content than other parts of the DNA. Genes with longer coding regions have even higher GC content.
  • Regions of DNA with higher GC content require higher temperatures for certain chemical reactions, such as when copying/duplicating the DNA.
  • GC content can be used in determining classification of species.

Your program will read data files produced by a high-throughput sequencer — a machine that takes as input some DNA, and produces as output a file containing a sequence of nucleotides.

Here are the first eight lines of output from a sequencer:

@SOLEXA-1GA-2_2_FC30DNN:1:2:574:1722
CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC
+SOLEXA-1GA-2_2_FC30DNN:1:2:574:1722
hhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhh
@SOLEXA-1GA-2_2_FC30DNN:1:2:478:1745
GTGGGGGTGATGTCCACGATTACGCCGACCGGCTGG
+SOLEXA-1GA-2_2_FC30DNN:1:2:478:1745
hhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhh

The nucleotide data is in the second line, the sixth line, the tenth line, etc. To calculate GC content, you will be looking for the percentage of the letters appearing on these lines that are G or C. Your program will not use the rest of the file, which provides information about the sequencer and the sequencing process that created the nucleotide data. The homework’s starter code provides the function to read the appropriate lines of the file into a string for your program to process. However, it will be useful for you to know what parts of the file are being used when checking your work by looking at files.