In this part of the homework, you will write code to perform various analytical operations on data parsed from a file.
hw3.py
math
and pandas
modules, but you may not use any other imports to solve these problems.pandas
as a tool to help answer questions about your dataset.In your main method, parse the data from the CSV file using pandas. Note that the file uses '---'
as the entry to represent missing data.
The function to read a CSV file in pandas
takes a parameter called na_values
that takes a str
to specify which values are NaN values in the file. It will replace all occurrences of those characters with NaN. You should specify this parameter to make sure the data parses correctly.
completions_between_years
What are the percent of different degrees completed for a given year range and sex?
Write a function completions_between_years
that takes as arguments a Pandas DataFrame
object, two year arguments, and a value for sex ('A', 'F', or 'M'). The function should return all rows of the data which match the given sex, and have data between the given years (inclusive for the start, exclusive for the end). If no data is found for the parameters, return None
.
For example, assuming we have parsed hw3-nces-ed-attainment.csv
and stored it in a variable called data
:
completions_between_years(data, 2007, 2008, 'F')
Year | Sex | Min degree | Total | White | Black | Hispanic | Asian | Pacific Islander | American Indian/Alaska Native | Two or more races | |
---|---|---|---|---|---|---|---|---|---|---|---|
152 | 2007 | F | high school | 89.1 | 94.2 | 87.9 | 70.7 | 98.5 | 86.0 | 90.2 | 87.9 |
168 | 2007 | F | associate's | 43.2 | 50.8 | 28.0 | 23.5 | 69.6 | 42.5 | 14.5 | 40.2 |
186 | 2007 | F | bachelor's | 33.0 | 39.2 | 20.0 | 15.4 | 62.5 | 32.1 | NaN | 29.6 |
202 | 2007 | F | master's | 7.6 | 9.4 | 3.7 | 2.6 | 17.7 | NaN | NaN | NaN |
The index of the DataFrame
is shown as the left-most column above.
compare_bachelors_1980
What were the percentages for women vs. men having earned a Bachelor's Degree in 1980? Call this method compare_bachelors_1980
and return the result as a DataFrame
with a row for men and a row for women with the columns "Sex" and "Total".
For example, assuming we have parsed hw3-nces-ed-attainment.csv
and stored it in a variable called data
, compare_bachelors_1980(data)
will return the following (order of rows does not matter):
Sex | Total | |
---|---|---|
112 | M | 24.0 |
180 | F | 21.0 |
The index of the DataFrame
is shown as the left-most column above.
top_2_2000s
What were the two most commonly awarded levels of educational attainment awarded between 2000-2010 (inclusive)? Use the mean percent over the years to compare the education levels. Call this method top_2_2000s
and return a Series
with the top two values (the index should be the degree names and the values should be the percent).
For example, assuming we have parsed hw3-nces-ed-attainment.csv
and stored it in a variable called data
, then top_2_2000s(data)
will return:
Min degree
high school 87.557143
associate's 38.757143
Name: Total, dtype: float64 <class 'pandas.core.series.Series'>
Hint: The Series
class also has a method nlargest
that behaves similarly to the one for the DataFrame
, but does not take a column parameter (as Series
objects don't have columns).
Our assert_equals
only checks that floating point numbers are within 0.001 of each other, so your floats do not have to match exactly.
percent_change_bachelors_2000s
What is the difference between total percent of bachelor's degrees received in 2000 as compared to 2010? Take a sex parameter so the client can specify 'M', 'F', or 'A' for evaluating. If a call does not specify the sex to evaluate, you should evaluate the percent change for all students (sex = ‘A’). Call this method percent_change_bachelors_2000s
and return the difference as a float.
For example, assuming we have parsed hw3-nces-ed-attainment.csv
and stored it in a variable called data
, then the call percent_change_bachelors_2000s(data)
will return 2.599999999999998
. Our assert_equals
only checks that floating point numbers are within 0.001 of each other, so your floats do not have to match exactly.
Hint:
For this problem you will need to use the squeeze()
function on a Series
to get a single value from a Series
of length 1.