Setup - CSE 160

Problem 0: Running the Test Files¶

Command Line Arguments¶

For this homework assignment, we will specify a command line argument when we run our Python file. This acts as an input to our Python program, allowing us to specify the name of the data file we want to read from.

By providing the file name as an argument, we can easily run our code on different data files. The code to perform this task is written for you in utils.py and then called on lines 51 and 54 of the dna_analysis.py file, which you’ll be working on. To make sure you can keep using the file input code, do not modify any code in the utils.py file or lines 16, 51, or 54 of the dna_analysis.py file. Instead, specify the data file you want to input as a command line argument (more details about this below).

When writing code that analyzes data, it is important to test your program so that you can be confident the output is correct. One way to do this is by comparing the output of your code to output produced in some other way, such as by hand or by a different program. We have provided a small test file for this purpose: test-small.fastq. This file is small enough that you can easily read it and calculate the GC content by hand. Then, you can use this file as input to your program to verify that it provides the correct answer for this file. You can open test-small.fastq and other input and output files in Jupyter by just clicking on it in the Explorer window.

You can run dna_analysis.py on sample data files by entering the following command in the terminal:

python dna_analysis.py data/test-small.fastq

The first two parts (python dna_analysis.py) are familiar to us, while the last part (data/test-small.fastq) specifies the file path to the data file, pointing Python to the location of the file it is supposed to read.

When you run the above command (make sure your terminal is in the homework2 folder), you should see the output “GC-content: 0.3”, as in this screenshot:

Test Output

You should confirm that dna_analysis.py is outputting the correct GC content for test-small.fastq by looking at test-small.fastq in Jupyter and calculating the GC content by hand. Remember that GC content is defined as the percentage of nucleotides that are either G or C.

Test Fastq

Using Different Test Files¶

After you have confirmed that dna_analysis.py is working on test-small.fastq, run it on each of the 6 real sample_N.fastq (where N is a number from 1 to 6) files provided, by executing 6 commands such as:

python dna_analysis.py data/sample_1.fastq

Run your program on different data files by changing sample_1.fastq to a different file name in the command above. Be patient — you are processing a lot of data, and it might take a minute or so to run. Many of these .fastq files are large enough that they are not easily opened in Jupyter or another text editor.

If you are interested, sample_3.fastq and sample_4.fastq are from Streptococcus pneumoniae TIGR4, and sample_5.fastq is from Human herpesvirus 5.

We have provided expected output files for the other test-*.fastq files in the expected_output directory:

Expected output

If your GC-content output does not match the expected output exactly, that’s ok for now. You won’t get the exact answer until you finish problem 4 in the programming part.

Now you’re ready to continue to the Programming section!