Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

In this homework, we will write a program to analyze data in a real-world scenario.

By the end of this assignment, students will feel more comfortable:

  1. Writing Python code using loops, conditionals, functions, and string manipulation.

  2. Running Python programs with command line arguments from the JupyterHub console.

  3. Interpreting program specifications involving a complicated real-world scenario.

Background

DNA is made up of a sequence of nucleotides. Each nucleotide is adenine (A), cytosine (C), guanine (G), or thymine (T). You will use, modify, and extend a program to compute the GC content of DNA data: the percentage of nucleotides that are either G or C. What can we do with GC content?

Here are the first eight lines of one of our sample DNA data files:

sample_6.fastq
@SOLEXA-1GA-2_2_FC30DNN:1:2:574:1722
CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC
+SOLEXA-1GA-2_2_FC30DNN:1:2:574:1722
hhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhh
@SOLEXA-1GA-2_2_FC30DNN:1:2:478:1745
GTGGGGGTGATGTCCACGATTACGCCGACCGGCTGG
+SOLEXA-1GA-2_2_FC30DNN:1:2:478:1745
hhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhh

The nucleotide data is in the second line, the sixth line, the tenth line, etc. To calculate GC content, you will be looking for the percentage of letters appearing on these lines that are G or C. Your program will not use the rest of the file, which provides information about the sequencer and the sequencing process that created the nucleotide data.

Command line arguments

For this homework assignment, we will specify a command line argument when we run our Python file. This acts as an input to our Python program, allowing us to specify the name of the data file we want to read from.

When writing code that analyzes data, it is important to test your program so that you can be confident the output is correct. One way to do this is by comparing the output of your code to output produced in some other way, such as by hand or by a different program. We have provided a small test file for this purpose: test-small.fastq. This file is small enough that you can easily read it and calculate the GC content by hand. Then, you can use this file as input to your program to verify that it provides the correct answer for this file.

test-small.fastq
ignore this line
ATCAGAACTA
ignore this line
ignore this line

From an attached Python console, run dna_analysis.py on this sample data file by entering the following command in the terminal, which specifies the relative file path to test-small.fastq:

%run dna_analysis.py data/test-small.fastq

The program should print:

GC-content: 0.3

Running test files

Now, try running the DNA analysis on each of the 6 other sample data files provided by executing 6 commands such as:

%run dna_analysis.py data/sample_1.fastq

Run your program on different data files by changing sample_1.fastq to a different sample data file name in the command above. It might take a minute or so to run since these are large data files.

We have provided expected output files for the other test-*.fastq files in the expected_output directory:

Formatting

By the end of the assignment, dna_analysis.py must produce output of the exact form:

GC-content: ____
AT-content: ____
G count: ____
C count: ____
A count: ____
T count: ____
Sum of G+C+A+T counts: ____
Total count: ____
Length of nucleotides: ____
AT/GC Ratio: ____
GC Classification: ____
Is suitable for nanotech: ____

Problem 1: Remove some lines

Run dna_analysis.py on test-small.fastq like you did for Problem 0. Be sure to take note of what output appears in the terminal. Then comment out the line gc_count = 0 by putting a # at the start of the line. Save the file and then run it again in the terminal. In answers.txt, explain what happened, and why it happened.

Now, restore the line to its original state by removing the # that you added. What would happen if you commented out the line nucleotides = filename_to_string(file_name) instead? Explain what happens and why in answers.txt.

Problem 2: Compute AT content

Modify your program so that, in addition to computing and printing the GC ratio, it also computes and prints the AT content: the percentage of nucleotides that are A or T. There are two ways to compute the AT content:

  1. Copy the existing loop that examines each nucleotide and modify it. You will now have two loops, one of which computes the GC count and one of which computes the AT count.

  2. Add more statements into the existing loop, so that one loop computes both the GC count and the AT count.

You may use whichever approach you prefer. Check your work by manually computing the AT content for test-small.fastq before comparing it to the output of running your program on test-small.fastq. Run your program on sample_1.fastq. Copy and paste the relevant line of output into answers.txt.

Problem 3: Count nucleotides

Modify your program so that it also computes and prints the number of A nucleotides, the number of T nucleotides, the number of G nucleotides, and the number of C nucleotides. When doing this, add at most one extra loop to your program. You can solve this part without adding any new loops at all by reusing an existing loop.

Check your work by manually computing the results for file test-small.fastq before comparing them to the output of running your program on test-small.fastq. Run your program on sample_1.fastq. Copy and paste the relevant lines of output into answers.txt (the lines that indicate the G count, C count, A count, and T count).

Problem 4: Check the data

Modify dna_analysis.py so that it will calculate and print the following variables:

Then run dna_analysis.py on each of the 11 .fastq files provided. As you run these files, you’ll notice that at least one of these quantities will be different from the other two for at least one .fastq file. In answers.txt, state which .fastq file(s) and which quantities produce different results. Also write a short paragraph that explains why these differ.

If all the three quantities you measured are the same, then it would not matter which one you used in the denominator when computing the GC content. However, you saw that the three quantities are not all the same. In answers.txt, state which of these quantities should be used in the denominator and which should not, and why.

If your program incorrectly computed the GC content, which should be equal to G+CA+C+G+T\frac{G+C}{A+C+G+T} then state that fact in answers.txt. Go back and correct your program, and also update any incorrect answers elsewhere in answers.txt. It is fine to change the code we provided you if needed.

Problem 5: Compute the AT/GC ratio

Sometimes, biologists use the AT/GC ratio, which is defined as A+TG+C\frac{A + T}{G + C}. Modify your program so that it also computes the AT/GC ratio. Check your work by manually computing the results for file test-small.fastq. Compare them to the output of running your program on test-small.fastq.

Run your program on sample_1.fastq. Copy and paste the relevant line(s) of output into answers.txt on the line that indicates the AT/GC ratio.

Problem 6: Categorize organisms

GC content can be used to categorize microorganisms. Complete the classify function, which should return a string "high", "medium", or "low" based on the organism’s GC content:

Biologists can use GC content for classifying species, for determining the melting temperature of DNA, for identifying suitability for DNA nanotechnology, etc. Here are some examples:

Test that your program works on some data files with known outputs. The test-small.fastq file has low GC content. We have provided four other test files whose names explain their GC content: test-medium-gc-1.fastq, test-medium-gc-2.fastq, test-high-gc-1.fastq, test-high-gc-2.fastq.

The classify function appears near the top of dna_analysis.py, just before where the main program begins. Replace the assignment statement at the top of the function.

dna_analysis.py
def classify(gc_content):
    """
    Returns a string representing GC content classification: "low", "moderate", or "high".

    gc_content: a number representing the GC content
    """

    # This statement is a placeholder. Replace it with your code (more than one line) that sets
    # classification to the correct value based on gc_content. Then, delete this comment.
    classification = "high"

    # YOUR CODE GOES HERE

    return classification

Once you have filled in the body of the classify function, call the function from your main program in the appropriate place and use the string it returns to print out a message that matches what is expected. For example, ensure your output for test-medium-gc-1.fastq matches test-medium-gc-1-expected.txt exactly using Diffchecker.

After your program works for all the test files, run it on sample_1.fastq. Copy and paste just the relevant line of output from your program into answers.txt (the line that indicates the GC classification).

Problem 6a: Nanotech

Modify the nano_suitable function to return whether or not the GC content is suitable for use in DNA nanotechnology. The decision should be made as follows:

The nano_suitable function appears underneath the classify function, just before where the main program begins.

dna_analysis.py
def nano_suitable(gc_content):
    """
    Returns a boolean representing if the given GC content is suitable for DNA nanotechnology.

    gc_content: a number representing the GC content
    """
    # YOUR CODE GOES HERE
    return True

As with classify, check the expected_output files with Diffchecker.

Code quality

Run our linter (automated code style checker) in the Python console with the expression !flake8. Edit the file and save your changes after addressing all reported issues. A successful !flake8 run will print nothing when there are no linting issues to report.

!flake8

Then, review our style guide, paying particular attention to:

Collaboration

If you discuss an assignment with one or more classmates, you must specify with whom you collaborated in a comment at the bottom of your submission. You may discuss with as many classmates as you like, but you must cite all of them in your work. Note that you may not collaborate in a way that is prohibited, even if you cite the collaboration.

At the bottom of both your dna_analysis.py and answers.txt files, state which students or other people (besides the course staff) helped you with the assignment, or that no one did.

Submission

Submit dna_analysis.py and answers.txt on Gradescope under the assignment Homework: DNA Analysis.