CSE 163, Spring 2019: Homework 3: Part 0

Statistical Functions with Pandas

In this part of the homework, you will write code to perform various analytical operations on data parsed from a file.

Expectations

  • All functions for this part of the assignment should be written in hw3.py
  • For this part of the assignment, you may import and use the math and pandas modules, but you may not use any other imports to solve these problems.
  • For all of the problems below, you should not use ANY loops or list/dictionary comprehensions. You should be able to solve all of the problems by only calling functions on pandas objects. The one exception is when a function asks you to return a Python list/dictionary and you need to convert from a pandas object to a Python list/dictionary; if this is the case you are allowed to have one loop at the end of the function to build up the list/dictionary, but this loop should not have any real logic in it besides moving values from the pandas structure to the list/dictionary. The goal of this part of the assignment is to use pandas as a tool to help answer questions about your dataset.

Problems

Problem 0) Parse data

In your main method, parse the data from the CSV file using pandas. Note that the file uses '---' as the entry to represent missing data. The function to read a CSV file in pandas takes a parameter called na_values that takes a list of strings that specify NaN values in the file and will replace all occurrences of those characters with NaN. You should specify this parameter to make sure the data parses correctly.

Problem 1) completions_between_years

What are the percent of different degrees completed for a given year range and sex? Write a function completions_between_years that takes as arguments a Pandas DataFrame object, two year arguments, and a value for sex ('A', 'F', or 'M'). The function should return all rows of the data which match the given sex, and have data between the given years (inclusive for the start, exclusive for the end). If no data is found for the parameters, return None.

For example, assuming we have parsed hw3-nces-ed-attainment.csv and stored it in a variable called data:

completions_between_years(data, 2007, 2008, 'F')
Year Sex Min degree Total White Black Hispanic Asian Pacific Islander American Indian/Alaska Native Two or more races
152 2007 F high school 89.1 94.2 87.9 70.7 98.5 86.0 90.2 87.9
168 2007 F associate's 43.2 50.8 28.0 23.5 69.6 42.5 14.5 40.2
186 2007 F bachelor's 33.0 39.2 20.0 15.4 62.5 32.1 NaN 29.6
202 2007 F master's 7.6 9.4 3.7 2.6 17.7 NaN NaN NaN

Problem 2) compare_bachelors_1980

What were the percentages for women vs. men having earned a Bachelor's Degree in 1980? Call this method compare_bachelors_1980 and return the percentages as a tuple: (% for men, % for women).

For example, assuming we have parsed hw3-nces-ed-attainment.csv and stored it in a variable called data, compare_bachelors_1980(data) will return (24.0, 21.0).

Problem 3) top_2_2000s

What were the two most commonly awarded levels of educational attainment awarded between 2000-2010 (inclusive)? Use the mean percent over the years to compare the education levels. Call this method top_2_2000s and return a list of tuples as follows: [(#1 level, mean % of #1 level), (#2 level, mean % of #2 level)].

For example, assuming we have parsed hw3-nces-ed-attainment.csv and stored it in a variable called data, then top_2_2000s(data) will return [('high school', 87.55714285714285), ("associate's", 38.75714285714286)]. Our assert_equals only checks that floating point numbers are within 0.001 of each other, so your floats do not have to match exactly.

Optional: Why 0.001?

Whenever you work with floating point numbers, it is very likely you will run into imprecision of floating point arithmetic. You have probably run into this with your every day calculator! If you take 1, divide by 3, and then multiply by 3 again you could get something like 0.99999999 instead of 1 like you would expect.

This is due to the fact that there is only a finite number of bits to represent floats so we will at some point lose some precision. Below, we show some example Python expressions that give imprecise results.

Because of this, you can never safely check if one float is == to another. Instead, we only check that the numbers match within some small delta that is permissible by the application. We kind of arbitrarily chose 0.001, and if you need really high accuracy you would want to only allow for smaller deviations, but equality is never guaranteed.

print(0.1 + 0.2)
# 0.30000000000000004

print((1.1 + 1.2) - 1.3)
# 0.9999999999999998

print(1.1 + (1.2 - 1.3))
# 1.0

Problem 4) percent_change_bachelors_2000s

What is the difference between total percent of bachelor's degrees received in 2000 as compared to 2010? Take a sex parameter so the client can specify 'M', 'F', or 'A' for evaluating. If a call does not specify the sex to evaluate, you should evaluate the percent change for all students (sex = ‘A’). Call this method percent_change_bachelors_2000s and return the difference as a float.

For example, assuming we have parsed hw3-nces-ed-attainment.csv and stored it in a variable called data, then the call percent_change_bachelors_2000s(data) will return 2.599999999999998. Our assert_equals only checks that floating point numbers are within 0.001 of each other, so your floats do not have to match exactly.