Learning objective: Apply control structures and data structures to solve problems mimicking different types of data cleanup and processing. Later, we’ll learn some real-world library functions to complete some of these tasks more efficiently!

Setting Up

You can find the starter code for the technical component of the THA here. Make sure to extract (unzip) the contents anywhere on your computer. You may optionally download them from JupyterHub, from the take-home-assessments/THA1 folder. If you are working in VS Code, navigate to File | Open or Open Folder, then select your downloaded folder.

  • hw1.py is the file to put your implementations for each problem. hw1.py is not a runnable program, so we don’t use the main-method pattern here.

  • hw1_test.py is the file for you to put your own tests. The Run button executes this program.

  • answers.txt is a text file for you to write your solutions to the debugging problems in the technical component.

  • hw1_creative.ipynb is the Jupyter notebook where you will write your solutions for the creative component.

Useful CSE 163 Resources

Expectations: In hw1.py and hw1_creative.ipynb, you should not use any import statements or features in Python we have not yet discussed in class, section, or homework specs. All of these problems should be solved using the fundamental constructs we’ve learned in class so far. hw1_test.py requires the import hw1 statement so that you can write assert statements. Don’t remove this line!

Technical Component

The technical component of this take-home assessment will be completed in hw1.py and hw1_test.py.

total and test_total

The total function should take in an int number n and returns the sum of the integers from 0 (inclusive) to n (inclusive). If n is negative, the function should return the value None instead. A preliminary solution has been provided for you, but it has a few bugs!

📝 Task: Identify and correct the bug, and then explain the bugs you encountered and what drew you to your specific fixes in answers.txt. For this task, you are already given one test case in hw1_test.py. Write at least 3 new test cases to help you debug total.

Documentation is already given to you for this task. Do not modify either doc-string!

five_number_summary and median

The function five_number_summary is defined already for you. This function takes a sorted list of at least 5 numbers and returns a tuple containing the five-number summary of the input: the input list’s (minimum, first_quartile, median, third_quartile, maximum).

For reference, here is how each value is defined:

  • The minimum is the smallest number in the list.

  • The maximum is the largest number in the list.

  • The median is the middle value in the list. If there is an even number of values in the list, take the average of the middle two values.

  • The first_quartile is the median of the lower half of the data (including the minimum), and the third_quartile is the median of the upper half of the data (including the maximum).

  • The median should be excluded from the calculations of the first and third quartiles.

Here are some examples of inputs and expected outputs, which you are expected to turn into assert tests in hw1_test.py.

  • five_number_summary([1, 2, 3, 4, 5]) should return (1, 1.5, 3, 4.5, 5)

  • five_number_summary([1, 1, 1, 1, 1]) should return (1, 1, 1, 1, 1)

  • five_number_summary([30, 31, 31, 34, 36, 38, 39, 51, 53]) should return (30, 31, 36, 45, 53)

  • five_number_summary([5, 13, 14, 15, 16, 17, 25]) should return (5, 13, 15, 17, 25)

  • five_number_summary([5, 12, 12, 13, 13, 15, 16, 26, 26, 29, 29, 30]) should return (5, 12.5, 15.5, 27.5, 30)

  • five_number_summary([12, 12, 13, 13, 15, 16, 26, 26, 29, 29]) should return (12, 13, 15.5, 26, 29)

Here are some examples of invalid function calls, which will not be tested.

  • five_number_summary([1]) since the input list does not have at least five numbers

  • five_number_summary([5, 4, 3, 2, 1]) since the input list is not sorted from least to greatest

📝 Task: The function five_number_summary is correct, but the helper function median is not. Write a test function in hw1_test.py that checks the output of median with assert statements so you can debug this function. Then, write a test function in hw1_test.py that checks for the correctness of five_number_summary, using assert statements. For the five_number_summary test function, include test cases for all of the valid examples above (there should be six of them) as well as 2 additional test cases (for a total of 8 tests).

num_outliers

An outlier is an extreme data point that can influence the shape and distribution of numeric data. $x$ is considered an outlier if either:

  • $x$ is less than the first quartile minus 1.5 times the interquartile range
  • $x$ is greater than the third quartile plus 1.5 times the interquartile range

The interquartile range is defined as the third quartile minus the first quartile.

📝 Task: Write a function num_outliers that takes a sorted list of at least five numbers and returns the number of data points that would be considered outliers using your five_number_summary to calculate the first and third quartiles.

  • num_outliers([1, 2, 3, 4, 5]) should return 0

  • num_outliers([1, 99, 200, 500, 506, 507]) should return 0

  • num_outliers([5, 13, 14, 15, 16, 17, 25]) should return 2 (the outliers are 5 and 25)

  • num_outliers([33, 34, 35, 36, 36, 36, 37, 37, 100, 101]) should return 2 (the outliers are 100 and 101)

  • num_outliers([8, 11, 11, 11, 11, 11]) should return 1 (the outlier is 8)

The following examples of invalid function calls should not be tested:

  • num_outliers([3, 3, 3]) input list should contain at least five numbers

  • num_outliers([3, 2, 1, 0, 5]) input list should be sorted from least to greatest

📝 Task: Write a test that calls the function with some inputs and compares the output of the program with the expected value using assert statements. Include test cases for all of the valid examples above (there should be five of them) as well as 2 additional test cases (for a total of 7 tests).

text_normalize

📝 Task: Write a function text_normalize that takes a string and returns a new string that keeps only alphabetical characters (ignore whitespace, numbers, non-alphabet characters, etc.) and turns all alphabetical characters to lowercase.

  • text_normalize("Hello") should return "hello"

  • text_normalize("Hello!") should return "hello"

  • text_normalize("heLLo tHEr3!!!") should return "hellother"

We recommend using a variable or a data structure to store all alphabetic characters, and then using this to keep only alphabetic characters in the original string.

📝 Task: Write a test function that calls the function with some inputs and compares the output of the program with the expected value using assert statements. Include test cases for all of the examples above as well as 2 additional test cases (for a total of 5 tests).

Creative Component: Even More Processing

The creative component will be completed in hw1_creative.ipynb.

The coding tasks in the technical component simulated the processes behind some common data processing problems, such as finding aggregate measures over numeric data or manipulating textual data. What are other tasks that you think might be useful for data processing?

Here are some general ideas– you may expand on them or come up with your own! Note that if you want to write a task and solution for any of these ideas, make sure that your task description is more specific than what we have here!

  1. Suppose we have an input string representing a date, like "January 1st, 2000". How would you convert this to a different date format, such as "1/1/2000" or "1st January, 2000"?
  2. Suppose we are given a poem in a .txt file. How might you find the average number of words per line in the poem?
  3. Suppose we have a list of x-coordinates, and a list of y-coordinates. How would you represent a list of xy coordinate pairs, such as (0, 0)?

Requirements

You may use the above ideas as inspiration, but they are written vaguely on purpose. Your coding tasks should be original problems that you write and design yourself. Across both of your coding task solutions, include at least 4 of the following topics:

  • String manipulation
  • File I/O
  • Lists
  • Conditionals
  • Loops (while or for)
  • set
  • tuple
  • dict

For example, maybe one of your tasks involves loops and sets, while the other involves string manipulation and file I/O. You may use the same topic in both functions, but it will only count once! So, if you used loops and sets in Task 1 and loops and tuples in Task 2, this would only count as 3 distinct topics.

You are free to choose which topics you incorporate into your coding tasks, but your solution must be presented as a fully documented and tested function. This means that you are also responsible for making sure your tests are appropriate for the function you write!

Make sure to include at least 1 topic per function; avoid putting all 4 required topics in a single function!

Code Quality

Assessment submissions should pass these checks: flake8, and code quality guidelines. The code quality guidelines are very thorough. For this assessment, the most relevant rules can be found in these sections:

Note

Make sure to provide a descriptive file header in docstring format, not something generic like “implements functions for Processing”. Consider: what do these functions have in common? What is the common theme of this assignment?

Submission

Submit your work by uploading the following files to Gradescope:

  • hw1.py
  • hw1_test.py
  • hw1_creative.ipynb
  • answers.txt
  • Any other .txt files that you may have used when writing or testing hw1_creative.ipynb

You will receive a warning from Gradescope if any file is missing! Submit as often as you want until the deadline for the initial submission. Note that we will only grade your most recent submission.

Please make sure you are familiar with the resources and policies outlined in the syllabus and the take-home assessments page.

THA 1 - Processing

Initial Submission by Thursday 01/15 at 11:59 pm.

Submit on Gradescope