This assignment and its reflection are due by Thursday, April 25 at 11:59 pm.
You should submit your finished
hw3.py
,
hw3-written.txt
,
line_plot_bachelors.png
,
bar_chart_high_school.png
,
and plot_hispanic_min_degree.png
on Gradescope and the reflection on Google Forms
In this assignment, you will apply what you've learned so far in a more extensive "real-world" dataset using more powerful features of the Pandas library. As in HW2, this dataset is provided in CSV format. We have cleaned up the data some, but you will need to handle more edge cases common to real-world datasets, including null cells to represent unknown information.
Note that there is no graded testing portion of this assignment. We still recommend writing tests to verify the correctness of the methods that you write in Part 0, but it will be difficult to write tests for Part 1 and 2. We've provided tips in those sections to help you gain confidence about the correctness of your solutions without writing formal test functions!
This assignment is supposed to introduce you to various parts of the data science process involving being able to answer questions about your data, how to visualize your data, and how to use your data to make predictions for new data. To help prepare for your final project, this assignment has been designed to be wide in scope so you can get practice with many different aspects of data analysis. While this assignment might look large because there are many parts, each individual part is relatively small.
After this homework, students will be able to:
Here are some baseline expectations we expect you to meet:
Follow the course collaboration policies
You should download the starter code hw3.zip and open it as the project in Visual Studio Code. The files included are:
hw3-nces-ed-attainment.csv
: A CSV file that contains data from the National Center for Education Statistics. This is described in more detail below.hw3.py
: The file for you to put solutions to Part 0, Part 1, and Part 2. You should also add a main method from which you can call these methods.hw3-written.txt
: The file for you to put your answers to the questions in Part 3.cse163_utils.py
: Provides utility functions for this assignment. You probably don't need to use anything inside this file except importing it if you have a Mac (see comment in hw3.py
)The dataset you will be processing comes from the National Center for Education Statistics. You can find the original dataset here. We have cleaned it a bit to make it easier to process in the context of this assignment. You must use our provided CSV file in this assignment.
The original dataset is titled: Percentage of persons 25 to 29 years old with selected levels of educational attainment, by race/ethnicity and sex: Selected years, 1920 through 2018. The cleaned version you will be working with has columns for Year, Sex, Educational Attainment, and race/ethnicity categories considered in the dataset. Note that not all columns will have data starting at 1920.
Our provided hw3-nces-ed-attainment.csv
looks like: (... represents omitted rows):
Year | Sex | Min degree | Total | White | Black | Hispanic | Asian | Pacific Islander | American Indian/Alaska Native | Two or more races |
---|---|---|---|---|---|---|---|---|---|---|
1920 | A | high school | --- | 22.0 | 6.3 | --- | --- | --- | --- | --- |
1940 | A | high school | 38.1 | 41.2 | 12.3 | --- | --- | --- | --- | --- |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
2018 | F | master's | 10.7 | 12.6 | 6.2 | 3.8 | 29.9 | --- | --- | --- |
Part 4a: Submit Assignment and Part 4b: Complete Reflection. On Gradescope, you should submit:
hw3.py
hw3-written.txt
line_plot_bachelors.png
bar_chart_high_school.png
plot_hispanic_min_degree.png
Your submission will be evaluated on the following dimensions
hw3.py
uses the main method structure we've shown on previous assignments.flake8