CSE 163, Spring 2019: Homework 3: Data Analysis

Submission

This assignment and its reflection are due by Thursday, April 25 at 11:59 pm.

You should submit your finished hw3.py, hw3-written.txt, line_plot_bachelors.png, bar_chart_high_school.png, and plot_hispanic_min_degree.png on Gradescope and the reflection on Google Forms

Overview

In this assignment, you will apply what you've learned so far in a more extensive "real-world" dataset using more powerful features of the Pandas library. As in HW2, this dataset is provided in CSV format. We have cleaned up the data some, but you will need to handle more edge cases common to real-world datasets, including null cells to represent unknown information.

Note that there is no graded testing portion of this assignment. We still recommend writing tests to verify the correctness of the methods that you write in Part 0, but it will be difficult to write tests for Part 1 and 2. We've provided tips in those sections to help you gain confidence about the correctness of your solutions without writing formal test functions!

This assignment is supposed to introduce you to various parts of the data science process involving being able to answer questions about your data, how to visualize your data, and how to use your data to make predictions for new data. To help prepare for your final project, this assignment has been designed to be wide in scope so you can get practice with many different aspects of data analysis. While this assignment might look large because there are many parts, each individual part is relatively small.

Learning Objectives

After this homework, students will be able to:

  • Work with basic Python data structures.
  • Use Pandas as the primary tool to process structured data in Python with CSV files, exploring more powerful library features to analyze time series data.
    • Handle edge cases appropriately, including addressing missing values/data.
    • Practice user-friendly error-handling.
  • Use Seaborn to make simple plots to investigate a specific phenomenon.
    • Read plotting library documentation and use example plotting code to figure out how to create more complex Seaborn plots.
  • Train a machine learning model and use it to make a prediction about the future using the scikit-learn library.

Expectations

Here are some baseline expectations we expect you to meet:

Files

You should download the starter code hw3.zip and open it as the project in Visual Studio Code. The files included are:

  • hw3-nces-ed-attainment.csv: A CSV file that contains data from the National Center for Education Statistics. This is described in more detail below.
  • hw3.py: The file for you to put solutions to Part 0, Part 1, and Part 2. You should also add a main method from which you can call these methods.
  • hw3-written.txt: The file for you to put your answers to the questions in Part 3.
  • cse163_utils.py: Provides utility functions for this assignment. You probably don't need to use anything inside this file except importing it if you have a Mac (see comment in hw3.py)

Data

The dataset you will be processing comes from the National Center for Education Statistics. You can find the original dataset here. We have cleaned it a bit to make it easier to process in the context of this assignment. You must use our provided CSV file in this assignment.

The original dataset is titled: Percentage of persons 25 to 29 years old with selected levels of educational attainment, by race/ethnicity and sex: Selected years, 1920 through 2018. The cleaned version you will be working with has columns for Year, Sex, Educational Attainment, and race/ethnicity categories considered in the dataset. Note that not all columns will have data starting at 1920.

Our provided hw3-nces-ed-attainment.csv looks like: (... represents omitted rows):

Year Sex Min degree Total White Black Hispanic Asian Pacific Islander American Indian/Alaska Native Two or more races
1920 A high school --- 22.0 6.3 --- --- --- --- ---
1940 A high school 38.1 41.2 12.3 --- --- --- --- ---
... ... ... ... ... ... ... ... ... ... ...
2018 F master's 10.7 12.6 6.2 3.8 29.9 --- --- ---

Column Descriptions

  • Year:The year this row represents. Note there may be more than one column for the same year to show the percent breakdowns by sex.
  • Sex: The sex of the students this row pertains to, one of "F" for female, "M" for male, or "A" for all students.
  • Min degree The degree this row pertains to. One of "high school", "associate's", "bachelor's", or "master's".
  • Total: The total percent of students of the specified gender to reach at least the minimum level of educational attainment in this year.
  • White / Black / Hispanic / Asian / Pacific Islander / American Indian or Alaska Native / Two or more races: The percent of students of this race and the specified gender to reach at least the minimum level of educational attainment in this year.

Interactive Development

When using data science libraries like pandas, seaborn, or scikit-learn it's extremely helpful to actually interact with the tools your using so you can have a better idea about the shape of your data. The preferred practice by people in industry is to use a Jupyter Notebook, like we have been in lecture, to play around with the dataset to help figure out how to answer the questions you want to answer. This is incredibly helpful when you're first learning a tool as you can actually experiment and get real-time feedback if the code you wrote does what you want.

We recommend that you try figuring out how to solve these problems in a Jupyter Notebook so you can actually interact with the data. We have made a playground Jupyter Notebook for you to use that already has the data loaded. Remember, that playground notebooks on Colaboratory are temporary unless you save them to your Google Drive! If you want to save your work on the notebook, you should make sure you explicitly press the save button and follow the instructions to copy!

Table of Contents

Evaluation

Your submission will be evaluated on the following dimensions

  • Your solution correctly implements the described behaviors. You will have access to some tests when you turn in your assignment, but we will withhold other tests to test your solution when grading. All behavior we test is completely described by the problem specification or shown in an example.
  • You solution file hw3.py uses the main method structure we've shown on previous assignments.
  • Your code meets our style requirements:
    • All code files submitted pass flake8
    • Every function written is commented using a doc-string format that describes its behavior, parameters, returns, and highlights any special cases.
    • There is a comment at the top of each code file you write with your name, section, and a brief description of what that program does.
    • Any expectations on this page or the sub-pages for the assignment are met as well as all requirements for each of the problems are met.