CSE 163, Spring 2020: Homework 2: Processing CSV Data

Overview

In this assignment, you will use the foundational Python skills you've been developing and apply them to analyze a small dataset. Many datasets you’ll be working with are structured as CSV or tabular representation - this assignment will be an introduction to reading, processing, and grouping rows and columns to calculate some interesting statistics. These skills will be very useful to have a strong foundation in when we work with much larger (and less complete) real-world datasets!

This assignment is broken in to two main parts, where each part mostly does the same computations in different ways. This is to give you the opportunity to compare/contrast different approaches to solving problems.

Learning Objectives

After this homework, students will be able to:

  • Follow a Python development work flow for this course, including:
    • Writing a Python script from scratch and turning in the assignment.
    • Use the course infrastructure (flake8, test suites, course resources).
  • Use Python to review CS1 programming concepts and implement programs that follow a specification, including:
    • Use/manipulation of various data types including numbers and strings.
    • Control structures (conditional statements, loops, parameters, returns) to implement basic functions provided by a specification.
    • Basic text file processing.
    • Documenting code.
  • Write unit tests to test each function written including their edge cases.
  • Work with data structures (lists, sets, dictionaries) in Python
  • Process structured data in Python with CSV files as input with and without a library (Pandas)
    • Handle edge cases appropriately, including addressing missing values/data
    • Practice user-friendly error-handling
  • Apply programming to identify and investigate a question on a dataset using basic statistical concepts (e.g. mean, max)

Expectations

Here are some baseline expectations we expect you to meet:

Files

If you are developing on Ed, all the files are there. If you are developing locally, you should download the starter code hw2.zip and open it as the project in Visual Studio Code. The files included are:

  • hw2_manual.py: The file for you to put solutions to Part 0.
  • hw2_pandas.py: The file for you to put solutions to Part 1.
  • hw2_test.py: The file for you to put your tests for Part 0 and Part 1.
  • cse163_utils.py: A file where we will store utility functions for helping you write tests.
  • run_hw2.py: A client program provided to call your functions. This is just for your convenience.
  • pokemon_box.csv: A CSV file that stores information about Pokemon. This columns of this file are explained below.
  • pokemon_test.csv: A very small CSV file that stores information about Pokemon. This columns of this file are explained below.

Data

For this assignment, you will be working with a dataset of Pokemon that you have caught on your Pokemon journey so far. The file pokemon_box.csv stores all the data about the captured Pokemon and has a format that looks like:

id name level personality type weakness atk def hp stage
1 Bulbasaur 12 Jolly Grass Fire 45 50 112 1
... ... ... ... ... ... ... ... ... ...

Note that because this is a CSV file, the file contents have these cells separated by commas.

Column Descriptions

  • id: Unique identification number corresponding to the species of a Pokemon. Note that if there are multiple Pokemon of the same species in the dataset, they all share the id.
  • name: Name of the species of Pokemon. For example Pikachu.
  • level: The level of this Pokemon (an integer)
  • personality: A one-word string describing the personality of this Pokemon
  • type: A one-word string describing the type of the Pokemon (e.g. "Grass" for Bulbasaur)
  • weakness: What type this Pokemon is weak to. For example, Bulbasaur is considered weak to the fire type.
  • atk, def, hp: Pokemon stats that indicate how many hits a Pokemon can take (hp), how strong its attacks are (atk), and how much hits affect it (def)
  • stage: Indicates if this Pokemon has evolved into a new species. For example, in the Charmander species (stage 1), it evolves into a Charmeleon (stage 2), which evolves into Charizard (stage 3). pokemon evolution stages

Part 0

In this part of the homework, you will write code to perform various analytical operations on data parsed from a file into the list of dictionaries represenation.

For this step of the assignment, you will be implementing various functions to answer questions about the dataset.

Each function should take the list returned by the cse163_utils.parse function (provided for you) as the first argument, along with any other arguments specified in each problem. For example, for the third function, we would call filter_range(data, 1, 10) where data was the list returned by cse163_utils.parse.

This data structure should not be modified by any function you write. Every problem that deals with strings should be case-sensitive (this means "chArIzard" is a different species than "Charizard"). You may make the following assumptions about the inputs:

  • You may assume the given list is non-empty for all functions you implement.
  • For each problem, you may assume we pass parameters of the expected types described for that problem and that those parameters are not None.
  • You should make no other assumptions about the parameters or the data.

For each of the problems, we will use the file pokemon_test.csv to show what should be returned.

id,name,level,personality,type,weakness,atk,def,hp,stage
59,Arcanine,35,impish,fire,water,50,55,90,2
59,Arcanine,35,gentle,fire,water,45,60,80,2
121,Starmie,67,sassy,water,electric,174,56,113,2
131,Lapras,72,lax,water,electric,107,113,29,1

Part 0 Expectations

  • All functions for this part of the assignment should be written in hw2_manual.py
  • For this part of the assignment, you may import the math module, but you may not use any other imports to solve these problems.

Problem 1: species_count

Write a function species_count that returns the number of unique Pokemon species (determined by the name attribute) found in the dataset. You may assume that the data is well formatted in the sense that you don't have to transform any values in the name column.

For example, assuming we have parsed pokemon_test.csv and stored it in a variable called data:

species_count(data)  # 3

Problem 2: max_level

Write a function max_level that finds the Pokemon with the max level and returns a tuple of length 2, where the first element is the name of the Pokemon and the second is its level. If there is a tie, the Pokemon that appears earlier in the file should be returned.

For example, assuming we have parsed pokemon_test.csv and stored it in a variable called data:

max_level(data)  # ('Lapras', 72)

Problem 3: filter_range

Write a function called filter_range that takes as arguments a smallest (inclusive) and largest (exclusive) level value and returns a list of Pokemon names having a level within that range. The list should return the species names in the same order that they appear in the provided list of dictionaries.

For example, assuming we have parsed pokemon_test.csv and stored it in a variable called data:

filter_range(data, 30, 70)  # ['Arcanine', 'Arcanine', 'Starmie']

Problem 4: mean_attack_for_type

Write a function called mean_attack_for_type that takes a Pokemon type (string) as an argument and that returns the average attack stat for all the Pokemon in the dataset with that type.

If there are no Pokemon of the given type, this function should return None.

For example, assuming we have parsed pokemon_test.csv and stored it in a variable called data:

mean_attack_for_type(data, 'fire')  # 47.5

Problem 5: count_types

Write a function called count_types that returns a dictionary with keys that are Pokemon types and values that are the number of times that type appears in the dataset.

The order of the keys in the returned dictionary does not matter. In terms of efficiency, your solution should NOT iterate over the whole dataset once for each type of Pokemon since that would be overly inefficient.

For example, assuming we have parsed pokemon_test.csv and stored it in a variable called data:

count_types(data)  # {'water': 2, 'fire': 2}

Problem 6: highest_stage_per_type

Write a function called highest_stage_per_type that calculates the largest stage reached for each type of Pokemon in the dataset. This function should return a dictionary that has keys that are the Pokemon types and values that are the highest value of stage column for that type of Pokemon.

The order of the keys in the returned dictionary does not matter. In terms of efficiency, your solution should NOT iterate over the whole dataset once for each type of Pokemon since that would be overly inefficient.

For example, assuming we have parsed pokemon_test.csv and stored it in a variable called data:

highest_stage_per_type(data)  # {'water': 2, 'fire': 2}

Problem 7: mean_attack_per_type

Write a function called mean_attack_per_type that calculates the average attack for every type of Pokemon in the dataset. This function should return a dictionary that has keys that are the Pokemon types and values that are the average attack for that Pokemon type.

The order of the keys in the returned dictionary does not matter. In terms of efficiency, your solution should NOT iterate over the whole dataset once for each type of Pokemon since that would be overly inefficient.

For example, assuming we have parsed pokemon_test.csv and stored it in a variable called data:

mean_attack_per_type(data)  # {'water': 140.5, 'fire': 47.5}

Part 1

In this part of the homework, you will write code to do various analytical operations on CSV data. Sounds a bit like Déjà vu! The functions you write for this assignment are exactly the same, but instead you will be using the pandas library to solve the functions.

For this step of the assignment, you will be implementing various functions to answer questions about the dataset. All of the problems are the same as Part 0 save for the fact that the input is now a DataFrame; you can check back to that page to see any examples you might need for this part.

Each function should take the DataFrame returned by the pd.read_csv function as the first argument, along with any other arguments specified in each problem. For example, for the third function, we would call filter_range(data, 1, 10) where data was the DataFrame returned by pd.read_csv.

This data structure should not be modified by any function you write. Every problem that deals with strings should be case-sensitive (this means "chArIzard" is a different species than "Charizard"). You may make the following assumptions about the inputs:

  • You may assume the DataFrame is non-empty for all functions you implement.
  • For each problem, you may assume we pass parameters of the expected types described for that problem and that those parameters are not None.
  • You should make no other assumptions about the parameters or the data.

Because this is the first assignment where you are using pandas, we will provide a "wordbank" of pandas functions/features that you might want to use on these problems. A brief version this list is shown below and you can find more background information about what these items mean by reading this Jupyter Notebook. At the top-right of this page in Ed is a "Fork" button (looks like a fork in the road). This will make your own copy of this Notebook so you can run the code and experiment with anything there!

Not every entry in the wordbank will be necessarily used and you will probably use certain functions/features listed multiple times for this assignment.

  • Get a column of a DataFrame
  • Get a row of a DataFrame (loc)
  • Filtering
  • Loop over Series
  • groupby
  • min
  • max
  • idxmin
  • idxmax
  • count
  • mean
  • unique

When using data science libraries like pandas, it's extremely helpful to actually interact with the tools your using so you can have a better idea about the shape of your data. The preferred practice by people in industry is to use a Jupyter Notebook, like we have been in lecture, to play around with the dataset to help figure out how to answer the questions you want to answer. This is incredibly helpful when you're first learning a tool as you can actually experiment and get real-time feedback if the code you wrote does what you want.

We recommend that you try figuring out how to solve these problems in a Jupyter Notebook so you can actually interact with the data. We have made a playground Jupyter Notebook for you to use that already has the data loaded. You will want to press the "Fork" button at the top-right (looks like a fork in the road) so you can get your own copy that you may edit.

Some of the functions below ask you to return a Python list or dict to keep it symmetric with Part 0. This will be difficult to do if you are working with pandas objects and are asked to not use any loops! You will want to use the following fact: you can use Python's casting to convert a Series into either a list or a dict.

For example, suppose I had the following CSV represented in a pandas DataFrame named data:

name,age,species
Fido,4,dog
Meowrty,6,cat
Chester,1,dog
Phil,1,axolotl

Then I could convert a Series derived from this DataFrame to a list or dict with the following syntax:

names = data['name']  # Series
list(names)  # ['Fido', 'Meowrty', 'Chester', 'Phil']
dict(names)  # {0: 'Fido', 1: 'Meowrty', 2: 'Chester', 3: 'Phil'}
row = data.loc[1]  # Series
list(row)  # ['Meowrty', 6, 'cat']
dict(row)  # {'name': 'Meowrty', 'age': 6, 'species': 'cat'}

This is not any sort of magic! For list, it just uses the values in the Series. For dict, it uses the index as keys and the values as values.

Part 1 Expectations

  • All functions for this part of the assignment should be written in hw2_pandas.py
  • For this part of the assignment, you may import the math and pandas modules, but you may not use any other imports to solve these problems.
  • For all of the problems below, you should not use ANY loops or list/dictionary comprehensions. You should be able to solve all of the problems by only calling functions on pandas objects. The goal of this part of the assignment is to use pandas as a tool to help answer questions about your dataset.

Problem 1: species_count

Write a function species_count that returns the number of unique Pokemon species (determined by the name attribute) found in the dataset. You may assume that the data is well formatted in the sense that you don't have to transform any values in the name column.

Problem 2: max_level

Write a function max_level that finds the Pokemon with the max level and returns a tuple of length 2, where the first element is the name of the Pokemon and the second is its level. If there is a tie, the Pokemon that appears earlier in the file should be returned.

Problem 3: filter_range

Write a function called filter_range that takes as arguments a smallest (inclusive) and largest (exclusive) level value and returns a list of Pokemon names having a level within that range. The list should return the species names in the same order that they appear in the provided list of dictionaries.

Note that you will want to return a Python list for this function so you will have to convert from a pandas object to a list.

Problem 4: mean_attack_for_type

Write a function called mean_attack_for_type that takes a Pokemon type (string) as an argument and that returns the average attack stat for all the Pokemon in the dataset with that type.

If there are no Pokemon of the given type, this function should return None.

Problem 5: count_types

Write a function called count_types that returns a dictionary with keys that are Pokemon types and values that are the number of times that type appears in the dataset.

The order of the keys in the dictionary does not matter.

Note that you will want to return a Python dictionary for this function so you will have to convert from a pandas object to a dictionary.

Problem 6: highest_stage_per_type

Write a function called highest_stage_per_type that calculates the largest stage reached for each type of Pokemon in the dataset. This function should return a dictionary that has keys that are the Pokemon types and values that are the highest value of stage column for that type of Pokemon.

The order of the keys in the returned dictionary does not matter.

Note that you will want to return a Python dictionary for this function so you will have to convert from a pandas object to a dictionary.

Problem 7: mean_attack_per_type

Write a function called mean_attack_per_type that calculates the average attack for every type of Pokemon in the dataset. This function should return a dictionary that has keys that are the Pokemon types and values that are the average attack for that Pokemon type.

The order of the keys in the dictionary does not matter.

Note that you will want to return a Python dictionary for this function so you will have to convert from a pandas object to a dictionary.

Part 2

In this part of the assignment, you will write tests for your solutions in Part 0 and Part 1.

Like in Homework 1, we have provided a function called assert_equals that takes an expected value and the value returned by your function, and compares them: if they don't match, the function will crash the program and tell you what was wrong. You can see more instructions an example for tests from the Homework 1 - Part 1 to see examples of how to call the tests.

Recall that on HW1, you had to use this absolute path (e.g. /home/poem.txt) on Ed in your testing program. This has to do with how Ed runs your program when marking it (it only copies your Python files in to its testing directory). You will need to continue to do this for future assignments.

What this means is in your hw2_test.py, any place you specify a file name (e.g. poem.txt), you should use the absolute path on Ed instead (e.g. /home/poem.txt)

Part 2 Expectations

For full credit, your hw2_test.py must satisfy all of the following conditions:

  • You should not use the file pokemon_box.csv to count for your own test cases. The file is too large to be able to meaningfully come up with the correct answer on your own (e.g., it's not valid to run your code and then paste the output as the "correct output"). You should submit your own testing CSV files.
  • Uses the main method pattern shown in class.
  • Your tests should be efficient in the sense that you don't have to re-read the datasets from the CSV for every tests function. If you want to use the same dataset in multiple functions, read it in once in main and pass them around as parameters to your test functions.
  • Has a function to test each of the functions in the hw2_manual.py and hw2_pandas.py that you were asked to write. It's okay to merge the test functions for the same problem in the different parts of the homework since their outputs are the same. You should not organize your tests such that there are only two test functions (e.g., test_hw2_manual and test_hw2_pandas) as we want to separate the test functions by each of the problems in the spec. There might be some redundancy in your tests which is expected since it's hard to factor out the different function calls and input types.
  • For each function you need to test, you should include a test that tests the example from the spec and one additional test that is different from what is shown in the spec. A single test is considered a call on assert_equals.
  • Each of these test functions should have a descriptive name that indicates which function is being tested (e.g. test_funky_sum)
  • Each of the test functions must be called from main.
  • Turn in any test CSV files you generate.

Evaluation

Your submission will be evaluated on the following dimensions:

  • Your solution correctly implements the described behaviors. You will have access to some tests when you turn in your assignment, but we will withhold other tests to test your solution when grading. All behavior we test is completely described by the problem specification or shown in an example.
  • When we run your code, it should produce no errors or warnings.
  • You should remove any debug print statements. Some students report Ed crashing when they try to print the whole dataset; remove those extra print statements to prevent excessive output.
  • Your code meets our style requirements:
    • All files submitted pass flake8
    • Your program should be written with good programming style. This means you should use the proper naming convention for methods (snake_case), your code should not be overly redundant and should avoid unnecessary computations.
    • Every function written is commented using a doc-string format that describes its behavior, parameters, returns, and highlights any special cases.
    • There is a comment at the top of each file you write with your name, section, and a brief description of what that program does.
    • Any expectations on this page or the sub-pages for the assignment are met as well as all requirements for each of the problems are met.

A lot of students have been asking questions like "Can I use this method or can I use this language feature in this class?". The general answer to this question is it depends on what you want to use, what the problem is asking you to do and if there are any restrictions that problem places on your solution.

There is no automatic deduction for using some advanced feature or using material that we have not covered in class yet, but if it violates the restrictions of the assignment, it is possible you will lose points. It's not possible for us to list out every possible thing you can't use on the assignment, but we can say for sure that you are safe to use anything we have covered in class so far as long as it meets what the specification asks and you are appropriately using it as we showed in class.

For example, some things that are probably okay to use even though we didn't cover them:

  • Using the update method on the set class even though I didn't show it in lecture. It was clear we talked about sets and that you are allowed to use them on future assignments and if you found a method on them that does what you need, it's probably fine as long as it isn't violating some explicit restriction on that assignment.
  • Using something like a ternary operator in Python. This doesn't make a problem any easier, it's just syntax.

For example, some things that are probably not okay to use:

  • Importing some random library that can solve the problem we ask you to solve in one line.
  • If the problem says "don't use a loop" to solve it, it would not be appropriate to use some advanced programming concept like recursion to "get around" that restriction.

These are not allowed because they might make the problem trivially easy or violate what the learning objective of the problem is.

You should think about what the spec is asking you to do and as long as you are meeting those requirements, we will award credit. If you are concerned that an advanced feature you want to use falls in that second category above and might cost you points, then you should just not use it! These problems are designed to be solvable with the material we have learned so far so it's entirely not necessary to go look up a bunch of advanced material to solve them.

tl;dr; We will not be answering every question of "Can I use X" or "Will I lose points if I use Y" because the general answer is "You are not forbidden from using anything as long as it meets the spec requirements. If you're unsure if it violates a spec restriction, don't use it and just stick to what we learned before the assignment was released."

Submission

This assignment is due by Thursday, April 23 at 23:59 (PDT). Please refer to the late policy for the policy on how late you may turn an assignment in.

You should submit your finished hw2_manual.py, hw2_pandas.py, and hw2_test.py on Ed.

You may submit your assignment as many times as you want before the late cutoff (remember submitting after the due date will cost late days). Recall on Ed, you submit by pressing the "Mark" button. You are welcome to develop the assignment on Ed or develop locally and then upload to Ed before marking.