Homework 2: DNA analysis

Due: at 11pm on June 28. Submit via Catalyst CollectIt (a.k.a. Dropbox).

You will use, modify, and extend a program to compute the GC content of DNA data.

DNA can be thought of as a sequence of nucleotides. Each nucleotide is adenine, cytosine, guanine, or thymine. These are abbreviated as A, C, G, and T. The GC content of DNA is the percentage of nucleotides that are either G or C. A nucleotide is also called a nucleotide base, nitrogenous base, nucleobase, or just a base.

Biologists have multiple reasons to be interested in GC content.

GC content can identify genes within the DNA, and can identify types of genes. Genes tend to have higher GC content than other parts of the DNA. Genes with longer coding regions have even higher GC content.
Regions of DNA with higher GC content require higher temperatures for certain chemical reactions, such as when copying/duplicating the DNA.
GC content can be used in determining classification of species.

If you are curious, Wikipedia has more information about GC content. That reading is optional and is not required to complete this assignment.

Your program will read files produced by a high-throughput sequencer — a machine that takes as input some DNA, and produces as output a file containing a sequence of nucleotides.

Here are the first 8 lines of output from a particular sequencer:

@SOLEXA-1GA-2_2_FC30DNN:1:2:574:1722
CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC
+SOLEXA-1GA-2_2_FC30DNN:1:2:574:1722
hhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhh
@SOLEXA-1GA-2_2_FC30DNN:1:2:478:1745
GTGGGGGTGATGTCCACGATTACGCCGACCGGCTGG
+SOLEXA-1GA-2_2_FC30DNN:1:2:478:1745
hhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhh

The nucleotide data is in the second line, the sixth line, the tenth line, etc. Your program will not use the rest of the data, which provides metadata about the sequencer itself.

Problem 1: Obtain the files, add your name

Obtain the files you need by downloading the homework2.zip file and unzipping it to create a homework2 directory/folder. (This is a large download — be patient.)

You will do your work by modifying two files — dna_analysis.py and answers.txt — and then submitting the modified versions. Add your name to the top of each of these files.

You will submit answers.txt as a text file. Plain text is the standard for communicating information among programmers, because it can be read on any computer without installing proprietary software. You can edit text files using IDLE. If you choose to edit it using a word processor such as Microsoft Word, be sure to save it as text.

Problem 2: Run the program

It is a good idea to check the correctness of your program by comparing it to a computation done in some other way, such as by hand or by a different program. We have provided the test-small.fastq file for this purpose. First, examine the file by hand to determine the correct answer. Then, run your program to verify that it provides the correct answer for this file.

Run your program by opening a shell or command prompt, then typing the following command. Recall that the “shell” or “command prompt” or “command line” is a service that is provided by your operating system, and the Computing Resources webpage tells you how to start one. You will not type the following command to the Python interpreter that is provided by IDLE.

python dna_analysis.py test-small.fastq

You have to execute this command in the directory/folder that contains the file test-small.fastq. You can find out what directory your shell or command prompt is in by typing pwd (for “print working directory”), and you can change the directory by typing cd (for “change directory”).

After you have confirmed that your program runs correctly on test-small.fastq, run your program on each of the 6 real sample_N.fastq files provided, by executing 6 commands such as

python dna_analysis.py sample_1.fastq

(but you will have to change sample_1.fastq to a different file name in the subsequent commands). Be patient — you are processing a lot of data, and it might take a minute or so to run.

(sample_3.fasq and sample_4.fastq are from Streptococcus pneumoniae TIGR4, and sample_5.fastq is from Human herpesvirus 5.)

Cut and paste the 6 lines of output (one for each run) into the answers.txt file.

Problem 3: Remove some lines

In your program, comment out these lines

  seq = ""
  linenum = 0

by prefixing them by the # character. In answers.txt, explain what happened, and why it happened. Now, restore the lines to their original state by removing the # that you added.

What would happen if you commented out this line?

  gc_count = 0

Explain (in answers.txt).

Problem 4: Compute AT ratio

Augment your program so that, in addition to computing and printing the GC ratio, it also computes and prints the AT ratio. The AT ratio is the percentage of nucleotides that are A or T.

Two ways to compute the AT ratio are:

Copy the existing loop that examines each base pair. You will now have two loops, one of which computes the GC count and one of which computes the AT count.
Add more statements into the existing loop, so that one loop computes both the GC count and the AC count.

You may use whichever approach you prefer.

Check your work by manually computing the AT ratio for file test-small.fastq, then comparing it to the output of running your program on test-small.fastq.

Run your program on the 6 provided sample_N.fastq files, and cut-and-paste the output into answers.txt.

Problem 5: Count nucleotides

Augment your program so that it also computes and prints the number of A nucleotides, the number of T nucleotides, the number of G nucleotides, and the number of C nucleotides.

When doing this, add at most one extra loop to your program. You can solve this part without adding any new loops at all, by reusing an existing loop.

Check your work by manually computing the results for file test-small.fastq, then comparing them to the output of running your program on test-small.fastq.

Cut-and-paste the relevant parts of the output of your program into answers.txt, just as you did for the previous parts of this assignment.

Problem 6: Sanity-check the data

For each .fastq file, compare the following quantities:

the sum of the A count, the C count, the G count, and the T count
the total_count variable
the length of the seq variable. You can compute this with len(seq).

For at least one file, at least two of these metrics will differ. Which file(s)? Which metrics? Write a short paragraph that explains why. (Explaining why might require you to do some detective work.)

Problem 7: Compute the AT/GC ratio

Sometimes biologists use the AT/GC ratio, defined as (A+T)/(G+C), rather than the GC-content, which is defined as (G+C)/(A+C+G+T).

Given your answer to problem 6 (“Sanity-check the data”), describe (in answers.txt) both a correct and an incorrect way to compute the AT/GC ratio.

Then, modify your program so that it uses the correct way. Don't remove any of your program's existing functionality — just augment it to also print the AT/GC ratio.

Check your work by manually computing the results for file test-small.fastq, then comparing them to the output of running your program on test-small.fastq.

Cut-and-paste the relevant parts of the output of your program into answers.txt, just as you did for the previous parts of this assignment.

Problem 8: Categorize organisms

The GC content can be used to categorize microorganisms.
The CG content of Streptomyces coelicolor A3(2) is 72%.
The GC content of Yeast (Saccharomyces cerevisiae) is 38%.
The GC content of Thale Cress (Arabidopsis thaliana) is 36%.
The GC content of Plasmodium falciparum is 20%.

Modify your program to print out a classification of the organism in the file.
If the GC content is above 60%, the organism is considered “high GC content”.
If the GC content is below 40%, the organism is considered “low GC content”.
Otherwise, the organism is considered “moderate GC content”.

Again, test your work. The test-small.fastq file has low GC content. We have provided four other test files, whose names explain their GC content: test-moderate-gc-1.fastq, test-moderate-gc-2.fastq, test-high-gc-1.fastq, test-high-gc-2.fastq.

After your program works for all the test files, run it on each of the 6 sample_N.fastq files. For each of the 6 files, cut-and-paste just the relevant line of output from your program into answers.txt.

Submit your work

You are almost done!

At the bottom of your answers.txt file, in the “Collaboration” part, state which students or other people (besides the course staff) helped you with the assignment, or that no one did.

At the bottom of your answers.txt file, in the “Reflection” part, state how many hours you spent on this assignment. Also state what you or the staff could have done to help you with the assignment.

Submit the following files via Catalyst CollectIt (a.k.a. Dropbox):

dna_analysis.py
answers.txt