CSE 163, Winter 2020: Homework 3: Data Analysis

Submission

This assignment and its reflection are due by Thursday, January 30 at 11:59 pm.

You should submit your finished hw3.py, and hw3-written.txt on Ed and the reflection on Google Forms

You may submit your assignment as many times as you want before the late cutoff (remember submitting after the due date will cost late days). Recall on Ed, you submit by pressing the "Mark" button. You are welcome to develop the assignment on Ed or develop locally and then upload to Ed before marking.

Overview

In this assignment, you will apply what you've learned so far in a more extensive "real-world" dataset using more powerful features of the Pandas library. As in HW2, this dataset is provided in CSV format. We have cleaned up the data some, but you will need to handle more edge cases common to real-world datasets, including null cells to represent unknown information.

Note that there is no graded testing portion of this assignment. We still recommend writing tests to verify the correctness of the methods that you write in Part 0, but it will be difficult to write tests for Part 1 and 2. We've provided tips in those sections to help you gain confidence about the correctness of your solutions without writing formal test functions!

This assignment is supposed to introduce you to various parts of the data science process involving being able to answer questions about your data, how to visualize your data, and how to use your data to make predictions for new data. To help prepare for your final project, this assignment has been designed to be wide in scope so you can get practice with many different aspects of data analysis. While this assignment might look large because there are many parts, each individual part is relatively small.

Learning Objectives

After this homework, students will be able to:

  • Work with basic Python data structures.
  • Use Pandas as the primary tool to process structured data in Python with CSV files, exploring more powerful library features to analyze time series data.
    • Handle edge cases appropriately, including addressing missing values/data.
    • Practice user-friendly error-handling.
  • Use Seaborn to make simple plots to investigate a specific phenomenon.
    • Read plotting library documentation and use example plotting code to figure out how to create more complex Seaborn plots.
  • Train a machine learning model and use it to make a prediction about the future using the scikit-learn library.

Expectations

Here are some baseline expectations we expect you to meet:

Files

If you are developing on Ed, all the files are there. If you are developing locally, you should download the starter code hw3.zip and open it as the project in Visual Studio Code. The files included are:

  • hw3-nces-ed-attainment.csv: A CSV file that contains data from the National Center for Education Statistics. This is described in more detail below.
  • hw3.py: The file for you to put solutions to Part 0, Part 1, and Part 2. You are required to add a main method that parses the provided dataset and calls all of the functions you are to write for this homework.
  • hw3-written.txt: The file for you to put your answers to the questions in Part 3.
  • cse163_utils.py: Provides utility functions for this assignment. You probably don't need to use anything inside this file except importing it if you have a Mac (see comment in hw3.py)

Data

The dataset you will be processing comes from the National Center for Education Statistics. You can find the original dataset here. We have cleaned it a bit to make it easier to process in the context of this assignment. You must use our provided CSV file in this assignment.

The original dataset is titled: Percentage of persons 25 to 29 years old with selected levels of educational attainment, by race/ethnicity and sex: Selected years, 1920 through 2018. The cleaned version you will be working with has columns for Year, Sex, Educational Attainment, and race/ethnicity categories considered in the dataset. Note that not all columns will have data starting at 1920.

Our provided hw3-nces-ed-attainment.csv looks like: (⋮ represents omitted rows):

Year Sex Min degree Total White Black Hispanic Asian Pacific Islander American Indian/Alaska Native Two or more races
1920 A high school --- 22.0 6.3 --- --- --- --- ---
1940 A high school 38.1 41.2 12.3 --- --- --- --- ---
2018 F master's 10.7 12.6 6.2 3.8 29.9 --- --- ---

Column Descriptions

  • Year: The year this row represents. Note there may be more than one row for the same year to show the percent breakdowns by sex.
  • Sex: The sex of the students this row pertains to, one of "F" for female, "M" for male, or "A" for all students.
  • Min degree: The degree this row pertains to. One of "high school", "associate's", "bachelor's", or "master's".
  • Total: The total percent of students of the specified gender to reach at least the minimum level of educational attainment in this year.
  • White / Black / Hispanic / Asian / Pacific Islander / American Indian or Alaska Native / Two or more races: The percent of students of this race and the specified gender to reach at least the minimum level of educational attainment in this year.

Interactive Development

When using data science libraries like pandas, seaborn, or scikit-learn it's extremely helpful to actually interact with the tools your using so you can have a better idea about the shape of your data. The preferred practice by people in industry is to use a Jupyter Notebook, like we have been in lecture, to play around with the dataset to help figure out how to answer the questions you want to answer. This is incredibly helpful when you're first learning a tool as you can actually experiment and get real-time feedback if the code you wrote does what you want.

We recommend that you try figuring out how to solve these problems in a Jupyter Notebook so you can actually interact with the data. We have made a playground Jupyter Notebook for you to use that already has the data loaded. Remember, that playground notebooks on Colaboratory are temporary unless you save them to your Google Drive! If you want to save your work on the notebook, you should make sure you explicitly press the save button and follow the instructions to copy!

Evaluation

Your submission will be evaluated on the following dimensions:

  • Your solution correctly implements the described behaviors. You will have access to some tests when you turn in your assignment, but we will withhold other tests to test your solution when grading. All behavior we test is completely described by the problem specification or shown in an example.
  • No method should modify its input parameters.
  • You solution file hw3.py uses the main method structure we've shown on previous assignments.
    • Your main method in hw3.py must call every one of the methods you implemented in this assignment. There are no requirements on the format of the output, besides that it should save the files for Part 1 with the proper names specified in Part 1.
  • We can run your hw3.py without it crashing or causing any errors.
  • Your code meets our style requirements:
    • All code files submitted pass flake8
    • Your program should be written with good programming style. This means you should use the proper naming convention for methods and variables (snake_case), your code should not be overly redundant and should avoid unnecessary computations.
    • Every function written is commented using a doc-string format that describes its behavior, parameters, returns, and highlights any special cases.
    • There is a comment at the top of each code file you write with your name, section, and a brief description of what that program does.
    • Any expectations on this page or the sub-pages for the assignment are met as well as all requirements for each of the problems are met.

Evaluation

Make sure you carefully read the bullets above as they may or may not change from assignment to assignment!