Homework 2: Processing CSV Data

In this assignment, you will use the foundational Python skills you've been developing and apply them to analyze a small dataset. Many datasets you’ll be working with are structured as CSV or tabular representation - this assignment will be an introduction to reading, processing, and grouping rows and columns to calculate some interesting statistics. These skills will be very useful to have a strong foundation in when we work with much larger (and less complete) real-world datasets!

This assignment is broken in to two main parts, where each part mostly does the same computations in different ways. This is to give you the opportunity to compare/contrast different approaches to solving problems.

Learning Objectives

After this homework, students will be able to:

Follow a Python development work flow for this course, including:
- Writing a Python script from scratch and turning in the assignment.
- Use the course infrastructure (flake8, test suites, course resources).
Use Python to review CS1 programming concepts and implement programs that follow a specification, including:
- Use/manipulation of various data types including numbers and strings.
- Control structures (conditional statements, loops, parameters, returns) to implement basic functions provided by a specification.
- Basic text file processing.
- Documenting code.
Write unit tests to test each function written including their edge cases.
Work with data structures (lists, sets, dictionaries) in Python
Process structured data in Python with CSV files as input with and without a library (Pandas)
- Handle edge cases appropriately, including addressing missing values/data
- Practice user-friendly error-handling
Apply programming to identify and investigate a question on a dataset using basic statistical concepts (e.g. mean, max)

Expectations

Here are some baseline expectations we expect you to meet:

Follow the course collaboration policies
Do not round any outputs or return values for this assignment!

If you are developing on Ed, all the files are there. If you are developing locally, you should download the starter code hw2.zip and open it as the project in Visual Studio Code. The files included are:

hw2_manual.py: The file for you to put solutions to Part 0.
hw2_pandas.py: The file for you to put solutions to Part 1.
hw2_test.py: The file for you to put your tests for Part 0 and Part 1.
cse163_utils.py: A file where we will store utility functions for helping you write tests.
run_hw2.py: A client program provided to call your functions. This is just for your convenience.
pokemon_box.csv: A CSV file that stores information about Pokemon. This columns of this file are explained below.
pokemon_test.csv: A very small CSV file that stores information about Pokemon. This columns of this file are explained below.

For this assignment, you will be working with a dataset of Pokemon that you have caught on your Pokemon journey so far. The file pokemon_box.csv stores all the data about the captured Pokemon and has a format that looks like:

id	name	level	personality	type	weakness	atk	def	hp	stage
1	Bulbasaur	12	Jolly	Grass	Fire	45	50	112	1
...	...	...	...	...	...	...	...	...	...

Note that because this is a CSV file, the file contents have these cells separated by commas.

Column Descriptions

id: Unique identification number corresponding to the species of a Pokemon. Note that if there are multiple Pokemon of the same species in the dataset, they all share the id.
name: Name of the species of Pokemon. For example Pikachu.
level: The level of this Pokemon (an integer)
personality: A one-word string describing the personality of this Pokemon
type: A one-word string describing the type of the Pokemon (e.g. "Grass" for Bulbasaur)
weakness: What type this Pokemon is weak to. For example, Bulbasaur is considered weak to the fire type.
atk, def, hp: Pokemon stats that indicate how many hits a Pokemon can take (hp), how strong its attacks are (atk), and how much hits affect it (def)
stage: Indicates if this Pokemon has evolved into a new species. For example, in the Charmander species (stage 1), it evolves into a Charmeleon (stage 2), which evolves into Charizard (stage 3).

In this part of the homework, you will write code to perform various analytical operations on data parsed from a file into the list of dictionaries represenation.

For this step of the assignment, you will be implementing various functions to answer questions about the dataset.

Each function should take the list returned by the cse163_utils.parse function (provided for you) as the first argument, along with any other arguments specified in each problem. For example, for the third function, we would call filter_range(data, 1, 10) where data was the list returned by cse163_utils.parse.

This data structure should not be modified by any function you write. Every problem that deals with strings should be case-sensitive (this means "chArIzard" is a different species than "Charizard"). You may make the following assumptions about the inputs:

You may assume the given list is non-empty for all functions you implement.
For each problem, you may assume we pass parameters of the expected types described for that problem and that those parameters are not None.
You should make no other assumptions about the parameters or the data.

For each of the problems, we will use the file pokemon_test.csv to show what should be returned.

id,name,level,personality,type,weakness,atk,def,hp,stage
59,Arcanine,35,impish,fire,water,50,55,90,2
59,Arcanine,35,gentle,fire,water,45,60,80,2
121,Starmie,67,sassy,water,electric,174,56,113,2
131,Lapras,72,lax,water,electric,107,113,29,1

All functions for this part of the assignment should be written in hw2_manual.py
For this part of the assignment, you may import the math module, but you may not use any other imports to solve these problems.

Problem 1: `species_count`

Write a function species_count that returns the number of unique Pokemon species (determined by the name attribute) found in the dataset. You may assume that the data is well formatted in the sense that you don't have to transform any values in the name column.