A1 - Processing

Learning objective: Apply control structures and data structures to solve problems mimicking different types of data cleanup and processing. Later, we’ll learn some real-world library functions to complete some of these tasks more efficiently!

You can find the starter code here. Make sure to extract (unzip) the contents anywhere on your computer. If you are working in VS Code, navigate to File | Open or Open Folder, then select the hw1 folder.

hw1.py is the file to put your implementations for each problem. hw1.py is not a runnable program, so we don’t use the main-method pattern.
hw1_test.py is the file for you to put your own tests. The Run button executes this program.
cse163_utils.py is a helper file that has code to help you test your code.
answers.txt is a file in which you will write your answers to one of the tasks from hw1.py.
song.txt is a file which is used for testing one of the functions from hw1.py.

Useful CSE 163 Resources¶

Expectations: In hw1.py you should not use any import statements or features in Python we have not yet discussed in class, section, or HW specs. All of these problems should be solved using the fundamental constructs we’ve learned in class so far. For your testing program, you can use imports (especially to use cse163_utils‘ assert_equals function).

`total` and `test_total`¶

The total function inside hw1.py and the test_total function inside hw1_test.py exist only for your reference. You aren’t required to modify either function but should use them as examples for how to structure your functions inside hw1.py and hw1_test.py, respectively.

Additionally, you should pay attention to how the comment for total mentions any special behavior that might not be obvious to someone who’s seeing the code for the first time. Keep this in mind as you document the other functions in this assessment.

`pair_up`¶

When creating and processing datasets, sometimes it’s useful to pair-up identifiers with each data element. For this task, you are given some buggy code that is intended to take a set of identifiers and a set of elements and returns a set of every identifier paired with every element. Since sets are unordered, there is no inherent ordering to the tuples in the result set.

📝 Task: Identify and correct the bug, and then explain the bugs you encountered and what drew you to your specific fixes in answers.txt. For this task, you are already given two test cases in hw1_test.py. You do not need to write additional test cases.

Tests and documentation are already given to you for this task. Do not modify these.

`text_normalize`¶

📝 Task: Write a function text_normalize that takes a string and returns a new string that keeps only alphabetical characters (ignore whitespace, numbers, non-alphabet characters, etc.) and turns all alphabetical characters to lowercase.

text_normalize("Hello") should return "hello"
text_normalize("Hello!") should return "hello"
text_normalize("heLLo tHEr3!!!") should return "hellother"

We recommend using a variable or a data structure to store all alphabetic characters, and then using this to keep only alphabetic characters in the original string.

📝 Task: Write a test that calls the function with some inputs and compares the output of the program with the expected value using assert_equals. Include test cases for all of the examples above as well as 2 additional test cases.

`average_tokens_per_line`¶

📝 Task: Write a function average_tokens_per_line that takes the name of a .txt file and returns the average number of tokens per line in the file, and 0 if the file is empty or has no words. (For this problem, a “word” is defined as any sequence of non-whitespace characters.) For example, if the file song.txt contains the text:

Row, row, row your boat
Gently down the stream
Merrily, merrily, merrily, merrily
Life is but a dream!

The first line has 5 tokens; the second has 4; the third has 4; and the fourth has 5. This gives an average tokens per line of 4.5. Then the call average_tokens_per_line('song.txt') should return 4.5.

📝 Task: Write a test that calls the function with some inputs and compares the output of the program with the expected value using assert_equals. Include test cases for the example above as well as 2 additional test cases.

Hint

Create new files in your workspace for each additional test case. When specifying file names, use absolute paths, such as /home/song.txt.

`reformat_date`¶

📝 Task: Write a function reformat_date which takes three strings as arguments representing a date, a current date format, and a target date format and returns a new string with the date formatted in the target format.

A date string will be some non-empty string of numbers separated by / (e.g, "3/6/1995"). Note that the numbers between the /‘s can have any number of digits, but you may assume there is at least one digit for each part of the date provided.

The current and target formats will be some non-empty sequence of the characters 'D', 'M', 'Y' separated by /. You can assume the given date and the current format will match up (i.e., they will have the same number of /‘s) and that any date symbol that appears in the target format also appears in the current format (i.e., if the target format contains a 'Y' the current format will also contain a 'Y').

For example, if we made the method call: reformat_date("12/31/1998", "M/D/Y", "D/M/Y"), it should return the string "31/12/1998". In this example, the first argument represents a date and the second argument specifies this date is currently in the format Month/Day/Year (abbreviated "M/D/Y"). The third argument specifies that the method should return a new date in the format of Day/Month/Year format (abbreviated "D/M/Y").

Below are some valid/invalid examples of current and target formats. You do not need to handle invalid formats. You may assume we will never pass you an example that is invalid.

Valid: Date - "1/2/3", Current Format - "M/D/Y", Target Format - "Y/M/D". It should return "3/1/2".
Valid: Date - "0/200/4", Current Format - "Y/D/M", Target Format - "M/Y". It should return "4/0".
Valid: Date - "3/2", Current Format - "M/D", Target Format - "D". It should return "2".
Invalid: Date - "3/2", Current Format - "M/D/Y", Target Format - "Y/M/D"
- Reason: The date and the current format don’t have the same number of parts.
Invalid: Date - "3/2", Current Format - "M/D", Target Format - "Y/M/D"
- Reason: Target contains a part of a date ("Y") not in the current format ("M/D").
Invalid: Date - "1/2/3/4", Current Format - "M/D/Y/S", Target Format - "M/D"
- Reason: Date format contains a part that is not 'M', 'D', or 'Y'.
Invalid: Date - "", Current Format - "", Target Format - "" (an empty string)
- Reason: An empty string.

Again, you may assume we won’t pass you any of these invalid date formats. We wanted to provide this list and rationale for why they are invalid to give you a better sense of what assumptions you can make about your inputs.

Your function should be as general as possible and avoid hard-coding specific orderings of the date parts.

`five_number_summary`¶

📝 Task: Write a function five_number_summary that takes a sorted list of at least 5 numbers and returns a tuple containing the five-number summary of the input: the input list’s (minimum, first_quartile, median, third_quartile, maximum).

Here are some tips for determining each value: - The minimum is the smallest number in the list.

The maximum is the largest number in the list.
The median is the middle value in the list. If there is an even number of values in the list, take the average of the middle two values.
The first_quartile is the median of the lower half of the data (including the minimum), and the third_quartile is the median of the upper half of the data (including the maximum).
The median should be excluded from the calculations of the first and third quartiles.

Below are some examples of valid function calls that should be tested: - five_number_summary([1, 2, 3, 4, 5]) should return (1, 1.5, 3, 4.5, 5)

five_number_summary([1, 1, 1, 1, 1]) should return (1, 1, 1, 1, 1)
five_number_summary([30, 31, 31, 34, 36, 38, 39, 51, 53]) should return (30, 31, 36, 45, 53)
five_number_summary([5, 13, 14, 15, 16, 17, 25]) should return (5, 13, 15, 17, 25)
five_number_summary([5, 12, 12, 13, 13, 15, 16, 26, 26, 29, 29, 30]) should return (5, 12.5, 15.5, 27.5, 30)
five_number_summary([12, 12, 13, 13, 15, 16, 26, 26, 29, 29]) should return (12, 13, 15.5, 26, 29)

The following examples of invalid function calls should not be tested:

five_number_summary([1]) since the input list does not have at least five numbers
five_number_summary([5, 4, 3, 2, 1]) since the input list is not sorted from least to greatest

Hint

We recommend defining a helper function to find the median of a given list!

`num_outliers`¶

An outlier is an extreme data point that can influence the shape and distribution of numeric data. $x$ is considered an outlier if either:

$x$ is less than the first quartile minus 1.5 times the interquartile range
$x$ is greater than the third quartile plus 1.5 times the interquartile range

The interquartile range is defined as the third quartile minus the first quartile.

📝 Task: Write a function num_outliers that takes a sorted list of at least five numbers and returns the number of data points that would be considered outliers using your five_number_summary to calculate the first and third quartiles.

num_outliers([1, 2, 3, 4, 5]) should return 0
num_outliers([1, 99, 200, 500, 506, 507]) should return 0
num_outliers([5, 13, 14, 15, 16, 17, 25]) should return 2 (the outliers are 5 and 25)
num_outliers([33, 34, 35, 36, 36, 36, 37, 37, 100, 101]) should return 2 (the outliers are 100 and 101)
num_outliers([8, 11, 11, 11, 11, 11]) should return 1 (the outlier is 8)

The following examples of invalid function calls should not be tested:

num_outliers([3, 3, 3]) input list should contain at least five numbers
num_outliers([3, 2, 1, 0, 5]) input list should be sorted from least to greatest

Code Quality¶

Assessment submissions should pass these checks: flake8, and code quality guidelines. The code quality guidelines are very thorough. For this assessment, the most relevant rules can be found in these sections:

Naming
Documentation
Efficiency and Redundancy
- Boolean Zen
- Loop Zen
- Factoring
- Unnecessary Cases
Type Annotations

Note

Make sure to provide a descriptive file header in docstring format, not something generic like “implements functions for Processing”.

Submission¶

Submit your work by uploading the following files to Gradescope:

hw1.py
hw1_test.py
answers.txt
song.txt
Any other .txt files that you created and used when testing average_tokens_per_line.

Submit as often as you want until the deadline for the initial submission. Note that we will only grade your most recent submission.

Please make sure you are familiar with the resources and policies outlined in the syllabus and the take-home assessments page.

THA 1 - Processing

Initial Submission by Thursday 07/10 at 11:59 pm.

Submit on Gradescope

Useful CSE 163 Resources¶

total and test_total¶

pair_up¶

text_normalize¶

average_tokens_per_line¶

reformat_date¶

five_number_summary¶

num_outliers¶

Code Quality¶

Submission¶

THA 1 - Processing

`total` and `test_total`¶

`pair_up`¶

`text_normalize`¶

`average_tokens_per_line`¶

`reformat_date`¶

`five_number_summary`¶

`num_outliers`¶