Important

Do not remove changes from earlier problems when you work on later problems. Your final program should solve all the problems.

For this assignment, you will fill in the body of one function and call it. You may add as many variables as you need to solve the problems, but you do not need to write additional functions for this assignment. For all of our assignments you should NOT use parts of Python not yet discussed in class or the course readings.

Note

You are encouraged to examine the provided function filename_to_string (in dna_analysis.py), which opens and reads data files, but you do not need to understand anything about opening and reading files in order to do this assignment.

Tip

If you are running into any issues, please take a look at the troubleshooting page. If that doesn’t solve all your issues, post on Ed or come to Office Hours.

Formatting

Warning

Not having this exact output may lead to points being deducted on your submission.

By the end of the assignment, dna_analysis.py must produce output of the exact form:

GC-content: ___
AT-content: ___
G count: ___
C count: ___
A count: ___
T count: ___
Sum of G+C+A+T counts: ___
Total count: ___
Length of nucleotides: ___
AT/GC Ratio: ___
GC Classification: ___
Is suitable for nanotech: ___

Problem 1: Remove Some Lines

Run dna_analysis.py in the terminal on test-small.fastq like you did for Problem 0. Be sure to take note of what output appears in the terminal. Then comment out the line gc_count = 0 by putting a # at the start of the line. Save the file and then run it again in the terminal.

In answers.txt, explain what happened, and why it happened.

Now, restore the line to its original state by removing the # that you added. What would happen if you commented out this line instead? nucleotides = filename_to_string(file_name)

Explain what happens and why in answers.txt.

Problem 2: Compute AT Content

Modify your program so that, in addition to computing and printing the GC ratio, it also computes and prints the AT content. The AT content is the percentage of nucleotides that are A or T.

There are two ways to compute the AT content:

  • Copy the existing loop that examines each nucleotide and modify it. You will now have two loops, one of which computes the GC count and one of which computes the AT count.
  • Add more statements into the existing loop, so that one loop computes both the GC count and the AT count.

You may use whichever approach you prefer. Check your work by manually computing the AT content for test-small.fastq, then comparing it to the output of running your program on test-small.fastq.

Run your program on sample_1.fastq. Copy and paste the relevant line of output into answers.txt.

Problem 3: Count Nucleotides

Info

At this point you should also feel free to modify the code we have given you if another structure of if statements makes more sense to you. We just caution you against looping through the data more times than you need to as this could cause your code to run very slowly.

Modify your program so that it also computes and prints the number of A nucleotides, the number of T nucleotides, the number of G nucleotides, and the number of C nucleotides.

When doing this, add at most one extra loop to your program. You can solve this part without adding any new loops at all, by reusing an existing loop.

Check your work by manually computing the results for file test-small.fastq, then comparing them to the output of running your program on test-small.fastq.

Tip

You can use the up-arrow key in the terminal window to retrieve commands you ran previously.

Run your program on sample_1.fastq. Copy and paste the relevant lines of output into answers.txt (the lines that indicate the G count, C count, A count, and T count).

Problem 4: Sanity-Check the Data

Debugging

This problem simulates a scenario that a lot of programmers encounter: when your program does not produce the results you would expect. Some reasons why this might occur is that your code contains a bug or you have an incorrect assumption about the contents of the data file. It’s hard to tell which is the reason without taking a closer look at your files. Usually this can be done by creating small snippets of code to isolate where the problem is occurring.

Modify dna_analysis.py so that it will calculate and print the following variables:

  • sum_counts: the sum of the A count, the C count, the G count, and the T count
  • total_count: the total number of nucleotides
  • len_nuc: the length of the nucleotides string variable using len(nucleotides).

Then run dna_analysis.py on each of the 11 .fastq files provided. As you run these files, you’ll notice that at least one of these quantities will be different from the other two for at least one .fastq file.

In answers.txt, state which .fastq file(s) and which quantities produce different results. Also write a short paragraph that explains why these differ.

If all the three quantities you measured are the same, then it would not matter which one you used in the denominator when computing the GC content. However, you saw that the three quantities are not all the same. In answers.txt, state which of these quantities should be used in the denominator and which should not, and why.

If your program incorrectly computed the GC content, which should be equal to

G+CA+C+G+T \frac{G+C}{A+C+G+T}

then state that fact in answers.txt. Go back and correct your program, **and also update any incorrect answers elsewhere in answers.txt. It is fine to change the code we provided you if needed.

Note

If you are unsure if you are calculating things correctly, now would be a good time to validate your dna_analysis.py program’s output using the Diff Checker (be sure that you select the entire line as it will show differences in trailing spaces as a difference). You can compare your output to the files given in the expected_output directory of the homework2 files. You have not yet completed the assignment, so your output will not be identical. But things like GC-content, AT-content, and individual counts should be identical. You will produce the last two lines of output in the expected_output files in Problem 7 and Problem 8 below.

Problem 5: Compute the AT/GC ratio

Sometimes biologists use the AT/GC ratio, defined as,

A+TG+C \frac{A+T}{G+C}

rather than the GC-content which is defined as,

G+CA+C+G+T \frac{G+C}{A+C+G+T}

Modify your program so that it also computes the AT/GC ratio. This should be calculated as: (A count + T count) / (G count + C count).

Check your work by manually computing the results for file test-small.fastq. Compare them to the output of running your program on test-small.fastq.

Run your program on sample_1.fastq. Copy and paste the relevant line(s) of output into answers.txt on the line that indicates the AT/GC ratio.

Problem 6: Categorize Organisms

The GC content can be used to categorize microorganisms. Fill in the body of the classify function to return the classification of the organism (“high”, “moderate” or “low”) described in the data file given using these classifications:

  • If the GC content is above 58%, the organism is considered “high GC content”.
  • If the GC content is below 35%, the organism is considered “low GC content”.
  • Otherwise, (the GC content is between 35% and 58% inclusively), the organism is considered “moderate GC content”.

Info

Even though the classification message that should be printed to the screen (as seen in the expected output files) will be “high GC content”, “low GC content”, or “moderate GC content”, your function should only return the values “high”, “moderate” or “low”.

Biologists can use GC content for classifying species, for determining the melting temperature of the DNA (useful for both ecology and experimentation, for example PCR is more difficult on organisms with high GC content), identifying its suitability for DNA nanotechnology, and for other purposes. Here are some examples:

  • The GC content of Streptomyces coelicolor A3(2) is 72%.
  • The GC content of Yeast (Saccharomyces cerevisiae) is 38%.
  • The GC content of Thale Cress (Arabidopsis thaliana) is 35%.
  • The GC content of Plasmodium falciparum is 20%.

Again, test that your program works on some data files with known outputs. The test-small.fastq file has low GC content. We have provided four other test files, whose names explain their GC content: test-moderate-gc-1.fastq, test-moderate-gc-2.fastq, test-high-gc-1.fastq, test-high-gc-2.fastq.

You will find starter code for the classify() function near the top of dna_analysis.py, just before where the main program begins. There is an assignment statement inside the function body that is there only as a placeholder.

# Function to return GC Classification
def classify(gc_content):
    '''
    gc_content - a number representing the GC content
    Returns a string representing GC Classification. Must return one of
    these: "low", "moderate", or "high" based on the cutoffs in the spec
    '''

    # This statement is a placeholder. For Problem 8, replace it with your
    # code (will be more than one line) that sets classification to the
    # correct value based on gc_content. (And then delete this comment.)
    classification = "high"

    return classification

The function classify takes the gc_content as an input parameter and returns an appropriate string indicating the GC Classification. Once you have filled in the body of the classify function you should call the function from your main program in the appropriate place and use the string it returns to print out a message that matches what is expected. For example, be sure that your output for test-moderate-gc-1.fastq matches test-moderate-gc-1-expected.txt exactly using Diff Checker.

After your program works for all the test files, run it on sample_1.fastq. Copy and paste just the relevant line of output from your program into answers.txt (the line that indicates the GC classification).

Problem 6a: Nanotech

Additionally, modify the nano_suitable function to return whether or not the DNA strand is suitable for use in DNA technology. The decision should be made as follows:

  • If the GC content is greater than 50% and less than 60%, nano_suitable should return True.
  • Otherwise, nano_suitable should return false.

The starter code for the nano_suitable function looks like this:

def nano_suitable(gc_content):
    '''
    gc_content - a number representing the GC content
    Returns a boolean representing if the given GC Content is suitable for
    DNA nanotechnology
    '''
    return True

Like with the classify, the expected output is shown in the expected output files.

Quality

Info

For the documentation portion of the quality guide, you should place comments for every problem.

Your assignment should pass two checks: flake8 and our code quality guidelines. The code quality guidelines are very thorough. For this assignment, the most relevant rules can be found in these sections:

Collaboration

Warning

If you discuss this assignment with one or more classmates, you must specify with whom you collaborated with in your submission. Otherwise, you may be guilty of academic misconduct and face consequences. Note that you may not collaborate in a way that is prohibited, even if you cite the collaboration. Please see the course syllabus for examples on prohibited collaboration.

At the bottom of your answers.txt file, in the “Collaboration” part, state which students or other people (besides the course staff) helped you with the assignment, or that no one did. Please refer to the course syllabus for more on the course collaboration policy.

Submission

Warning

Your files must be named dna_analysis.py and answers.txt in order to be graded correctly. Please remove any assert statements before submitting to gradescope.

Submit dna_analysis.py and answers.txt on Gradescope under the assignment Homework 2.

HW2 - Homework 2

Initial Submission by Tuesday 10/21 at 11:59 pm.

Submit on Gradescope