CSE 163, Winter 2020: Homework 2: Part 1

Overview

In this part of the homework, you will write code various analytical operations on CSV data. Sounds a bit like Déjà vu! The functions you write for this assignment are exactly the same, but instead you will be using the pandas library to solve the functions.

Expectations

  • All functions for this part of the assignment should be written in hw2_pandas.py
  • For this part of the assignment, you may import the math and pandas modules, but you may not use any other imports to solve these problems.
  • For all of the problems below, you should not use ANY loops or list/dictionary comprehensions. You should be able to solve all of the problems by only calling functions on pandas objects. The goal of this part of the assignment is to use pandas as a tool to help answer questions about your dataset.

Analysis

For this step of the assignment, you will be implementing various functions to answer questions about the dataset. All of the problems are the same as Part 0 save for the fact that the input is now a DataFrame; you can check back to that page to see any examples you might need for this part.

Each function should take the DataFrame returned by the pd.read_csv function as the first argument, along with any other arguments specified in each problem. For example, for the third function, we would call filter_range(data, 1, 10) where data was the DataFrame returned by pd.read_csv.

This data structure should not be modified by any function you write. Every problem that deals with strings should be case-sensitive (this means "chArIzard" is a different species than "Charizard"). You may make the following assumptions about the inputs:

  • You may assume the DataFrame is non-empty for all functions you implement.
  • For each problem, you may assume we pass parameters of the expected types described for that problem and that those parameters are not None.
  • You should make no other assumptions about the parameters or the data.

Hint: pandas wordbank

Because this is the first assignment where you are using pandas, we will provide a "wordbank" of pandas functions/features that you might want to use on these problems. A brief version this list is shown below and you can find more background information about what these items mean by reading this Jupyter Notebook. Not every entry in the wordbank will be necessarily used and you will probably use certain functions/features listed multiple times for this assignment.

  • Get a column of a DataFrame
  • Get a row of a DataFrame (loc)
  • Filtering
  • Loop over Series
  • groupby
  • min
  • max
  • idxmin
  • idxmax
  • count
  • mean
  • unique

Development Strategy

When using data science libraries like pandas, it's extremely helpful to actually interact with the tools your using so you can have a better idea about the shape of your data. The preferred practice by people in industry is to use a Jupyter Notebook, like we have been in lecture, to play around with the dataset to help figure out how to answer the questions you want to answer. This is incredibly helpful when you're first learning a tool as you can actually experiment and get real-time feedback if the code you wrote does what you want.

We recommend that you try figuring out how to solve these problems in a Jupyter Notebook so you can actually interact with the data. We have made a playground Jupyter Notebook for you to use that already has the data loaded. Remember, that playground notebooks on Colaboratory are temporary unless you save them to your Google Drive! If you want to save your work on the notebook, you should make sure you explicitly press the save button and follow the instructions to copy!

About Returns

Some of the functions below ask you to return a Python list or dict to keep it symmetric with Part 0. This will be difficult to do if you are working with pandas objects and are asked to not use any loops! You will want to use the following fact: you can use Python's casting to convert a Series into either a list or a dict.

For example, suppose I had the following CSV represented in a pandas DataFrame named data:

name,age,species
Fido,4,dog
Meowrty,6,cat
Chester,1,dog
Phil,1,axolotl

Then I could convert a Series derived from this DataFrame to a list or dict with the following syntax:

names = data['name']  # Series
list(names)  # ['Fido', 'Meowrty', 'Chester', 'Phil']
dict(names)  # {0: 'Fido', 1: 'Meowrty', 2: 'Chester', 3: 'Phil'}
row = data.loc[1]  # Series
list(row)  # ['Fido', 4, 'dog']
dict(row)  # {'name': 'Fido', 4: 'Meowrty', 'species: 'dog'}

This is not any sort of magic! For list, it just uses the values in the Series. For dict, it uses the index as keys and the values as values.

Problem 1: species_count

Write a function species_count that returns the number of unique Pokemon species (determined by the name attribute) found in the dataset. You may assume that the data is well formatted in the sense that you don't have to transform any values in the name column.

Problem 2: max_level

Write a function max_level that finds the Pokemon with the max level and returns a tuple of length 2, where the first element is the name of the Pokemon and the second is its level. If there is a tie, the Pokemon that appears earlier in the file should be returned.

Problem 3: filter_range

Write a function called filter_range that takes as arguments a smallest (inclusive) and largest (exclusive) level value and returns a list of Pokemon names having a level within that range. The list should return the species names in the same order that they appear in the provided list of dictionaries.

Note that you will want to return a Python list for this function so you will have to convert from a pandas object to a list.

Problem 4: mean_attack_for_type

Write a function called mean_attack_for_type that takes a Pokemon type (string) as an argument and that returns the average attack stat for all the Pokemon in the dataset with that type.

If there are no Pokemon of the given type, this function should return None.

Problem 5: count_types

Write a function called count_types that returns a dictionary with keys that are Pokemon types and values that are the number of times that type appears in the dataset.

The order of the keys in the dictionary does not matter.

Note that you will want to return a Python dictionary for this function so you will have to convert from a pandas object to a dictionary.

Problem 6: highest_stage_per_type

Write a function called highest_stage_per_type that calculates the largest stage reached for each type of Pokemon in the dataset. This function should return a dictionary that has keys that are the Pokemon types and values that are the highest value of stage column for that type of Pokemon.

The order of the keys in the returned dictionary does not matter.

Note that you will want to return a Python dictionary for this function so you will have to convert from a pandas object to a dictionary.

Problem 7: mean_attack_per_type

Write a function called mean_attack_per_type that calculates the average attack for every type of Pokemon in the dataset. This function should return a dictionary that has keys that are the Pokemon types and values that are the average attack for that Pokemon type.

The order of the keys in the dictionary does not matter.

Note that you will want to return a Python dictionary for this function so you will have to convert from a pandas object to a dictionary.