CSE 163, Winter 2020: Homework 3: Part 0

Statistical Functions with Pandas

In this part of the homework, you will write code to perform various analytical operations on data parsed from a file.

Expectations

  • All functions for this part of the assignment should be written in hw3.py
  • For this part of the assignment, you may import and use the math and pandas modules, but you may not use any other imports to solve these problems.
  • For all of the problems below, you should not use ANY loops or list/dictionary comprehensions. The goal of this part of the assignment is to use pandas as a tool to help answer questions about your dataset.

Problems

Problem 0) Parse data

In your main method, parse the data from the CSV file using pandas. Note that the file uses '---' as the entry to represent missing data.

The function to read a CSV file in pandas takes a parameter called na_values that takes a str to specify which values are NaN values in the file. It will replace all occurrences of those characters with NaN. You should specify this parameter to make sure the data parses correctly.

Problem 1) completions_between_years

What are the percent of different degrees completed for a given year range and sex?

Write a function completions_between_years that takes as arguments a Pandas DataFrame object, two year arguments, and a value for sex ('A', 'F', or 'M'). The function should return all rows of the data which match the given sex, and have data between the given years (inclusive for the start, exclusive for the end). If no data is found for the parameters, return None.

For example, assuming we have parsed hw3-nces-ed-attainment.csv and stored it in a variable called data:

completions_between_years(data, 2007, 2008, 'F')
Year Sex Min degree Total White Black Hispanic Asian Pacific Islander American Indian/Alaska Native Two or more races
152 2007 F high school 89.1 94.2 87.9 70.7 98.5 86.0 90.2 87.9
168 2007 F associate's 43.2 50.8 28.0 23.5 69.6 42.5 14.5 40.2
186 2007 F bachelor's 33.0 39.2 20.0 15.4 62.5 32.1 NaN 29.6
202 2007 F master's 7.6 9.4 3.7 2.6 17.7 NaN NaN NaN

The index of the DataFrame is shown as the left-most column above.

Problem 2) compare_bachelors_1980

What were the percentages for women vs. men having earned a Bachelor's Degree in 1980? Call this method compare_bachelors_1980 and return the result as a DataFrame with a row for men and a row for women with the columns "Sex" and "Total".

For example, assuming we have parsed hw3-nces-ed-attainment.csv and stored it in a variable called data, compare_bachelors_1980(data) will return the following (order of rows does not matter):

Sex Total
112 M 24.0
180 F 21.0

The index of the DataFrame is shown as the left-most column above.

Problem 3) top_2_2000s

What were the two most commonly awarded levels of educational attainment awarded between 2000-2010 (inclusive)? Use the mean percent over the years to compare the education levels. Call this method top_2_2000s and return a Series with the top two values (the index should be the degree names and the values should be the percent).

For example, assuming we have parsed hw3-nces-ed-attainment.csv and stored it in a variable called data, then top_2_2000s(data) will return:

Min degree
high school    87.557143
associate's    38.757143
Name: Total, dtype: float64 <class 'pandas.core.series.Series'>

Hint: The Series class also has a method nlargest that behaves similarly to the one for the DataFrame, but does not take a column parameter (as Series objects don't have columns).

Our assert_equals only checks that floating point numbers are within 0.001 of each other, so your floats do not have to match exactly.

Optional: Why 0.001?

Whenever you work with floating point numbers, it is very likely you will run into imprecision of floating point arithmetic. You have probably run into this with your every day calculator! If you take 1, divide by 3, and then multiply by 3 again you could get something like 0.99999999 instead of 1 like you would expect.

This is due to the fact that there is only a finite number of bits to represent floats so we will at some point lose some precision. Below, we show some example Python expressions that give imprecise results.

Because of this, you can never safely check if one float is == to another. Instead, we only check that the numbers match within some small delta that is permissible by the application. We kind of arbitrarily chose 0.001, and if you need really high accuracy you would want to only allow for smaller deviations, but equality is never guaranteed.

print(0.1 + 0.2)
# 0.30000000000000004

print((1.1 + 1.2) - 1.3)
# 0.9999999999999998

print(1.1 + (1.2 - 1.3))
# 1.0

Problem 4) percent_change_bachelors_2000s

What is the difference between total percent of bachelor's degrees received in 2000 as compared to 2010? Take a sex parameter so the client can specify 'M', 'F', or 'A' for evaluating. If a call does not specify the sex to evaluate, you should evaluate the percent change for all students (sex = ‘A’). Call this method percent_change_bachelors_2000s and return the difference as a float.

For example, assuming we have parsed hw3-nces-ed-attainment.csv and stored it in a variable called data, then the call percent_change_bachelors_2000s(data) will return 2.599999999999998. Our assert_equals only checks that floating point numbers are within 0.001 of each other, so your floats do not have to match exactly.

Hint: For this problem you will need to use the squeeze() function on a Series to get a single value from a Series of length 1.