A3 - Education - CSE 163

Useful CSE 163 Resources¶

Learning objective: Apply pandas, seaborn, and scikit-learn to process, visualize, and predict outcomes about data.

nces-ed-attainment.csv is a CSV file that contains the education dataset for this assessment.
hw3.py is the file for you to put your implementations. The Run button executes this program and cse163_imgd.py.
hw3-writeup.md is the file for your writeup. Instead of testing, this assessment emphasizes reflection on our data analysis.
cse163_imgd.py is a helper file that checks your plot outputs against expected output, and creates an image showing any pixel differences. The Run button executes this program and your hw3.py.
expected is a folder containing the expected output for line_plot_bachelors and bar_chart_high_school. Don’t modify the contents of this folder.
ScikitLearnWordbank.ipynb is a Jupyter Notebook that reviews all of the scikit-learn features needed for this assessment. Feel free to edit this Jupyter Notebook to prototype ideas and explore the data.

Info

The Run button works differently in this assessment than previous ones since you do not need to write your own tests. Once you’ve implemented plotting functions in hw3.py with calls to plt.savefig(), you’ll see that Run generates some images showing the pixel differences between your plot and the expected plot highlighted in red. If the image is blank, then all the pixels match. A summary of the percentage of pixels that match will appear in the console.

Context¶

The National Center for Education Statistics is a U.S. federal government agency for collecting and analyzing data related to education. We have downloaded and cleaned one of their datasets: Percentage of persons 25 to 29 years old with selected levels of educational attainment, by race/ethnicity and sex: Selected years, 1920 through 2018. The nces-ed-attainment.csv file has columns for Year, Sex, Min degree, and race/ethnicity categories. Note the missing data: not all columns have data starting from 1920!

Year	Sex	Min degree	Total	White	Black	Hispanic	Asian	Pacific Islander	American Indian/Alaska Native	Two or more races
1920	A	high school	---	22.0	6.3	---	---	---	---	---
1940	A	high school	38.1	41.2	12.3	---	---	---	---	---
⋮	⋮	⋮	⋮	⋮	⋮	⋮	⋮	⋮	⋮	⋮
2018	F	master's	10.7	12.6	6.2	3.8	29.9	---	---	---

Year is the year for the row. There may be more than one row for the same year to show the breakdowns by sex.
Sex is the sex of the people for the row: F for female, M for male, or A for all students.
Min degree is the degree of educational attainment for the row: high school, associate's, bachelor's, or master's.
Total is the overall percentage of people of the Sex in the Year with at least the Min degree of educational attainment.
White, Black, Hispanic, Asian, Pacific Islander, American Indian/Alaska Native, Two or more races is the percentage of students of the specified race (and of the Sex in the Year) with at least the Min degree of educational attainment.

Our main method provides a line of code to read nces-ed-attainment.csv and replaces all occurrences of the str --- with pandas NaN to help with later data processing steps.

Info

Remember to use the ScikitLearnWordbank.ipynb Jupyter Notebook to explore the data and work in a more interactive environment. You might even prefer writing all of your code in the notebook and periodically transferring functions over to hw3.py to run the automated checks.

Warning

Do not use any loops, list comprehensions, or dictionary comprehensions. The goal of this assessment is to apply data science libraries to answer questions.

Warning

Be sure to call all of the functions you write inside your main method!

Pandas: `compare_bachelors_1980`¶

What were the percentages for women vs. men having earned a Bachelor’s Degree in 1980?

Task: Write a function compare_bachelors_1980 and that takes the data and computes the percentages of men and women who achieved a minimum degree of a Bachelor’s degree in 1980. It should return a 2-by-2 DataFrame with rows corresponding to men and women and columns corresponding to Sex and Total. The order of the rows doesn’t matter. For example, your result should look something like this (where … is a placeholder for the correct number):

Sex	Total
M	...
F	...

Pandas: `top_2_2000s`¶

What were the 2 most commonly-awarded levels of educational attainment awarded between 2000–2010 (inclusive) for a given sex?

Task: Write a function top_2_2000s(data, sex) that takes two arguments, the data and a sex parameter, and computes the two most commonly earned degrees for that given sex between the years 2000 and 2010 (inclusive). sex should have a default value of 'A' if it is not specified. The function should return a 2-element Series. Compare educational attainment levels by using the mean. The index of the returned Series should be the Min degree and the values should be its mean. The sex parameter should default to 'A' if no sex parameter is specified.

For example, top_2_2000s(data, 'A') will return the following Series with index on the left and value on the right. Your values don’t have to exactly match so long as they’re within a 0.001 tolerance due to how Python represents float numbers. Because sex is set to default to 'A', this means a call to top_2_2000s(data) should also return the same Series.

high school 87.557143
associate's 38.757143

Hint

Series.nlargest works like DataFrame.nlargest but does not take a column parameter (Series objects don’t have columns).

Seaborn: `line_plot_bachelors`¶

Warning

For line_plot_bachelors and bar_chart_high_school, be sure to use the respective generic seaborn functions. Instead of barplot(), you should use catplot(), and instead of lineplot(), you should use relplot().

Task: Write a function line_plot_bachelors that takes the data and plots a line chart of the total percentages of all people Sex A with bachelor's Min degree over time. Label the x-axis Year, the y-axis Percentage, and title the plot Percentage Earning Bachelor’s over Time.

Percentage Earning Bachelor's over Time

Save the plot as line_plot_bachelors.png with parameter bbox_inches='tight'.

plt.savefig('line_plot_bachelors.png', bbox_inches='tight')

Tip

For this assignment, you do not need to save the files with the /home/ prefix. We give you the line of code you should use to save the output above.

Warning

Be careful copying axis and title labels from the spec. The apostrophe character in Bachelor’s is not the same when copied versus when it is typed.

Seaborn: `bar_chart_high_school`¶

Warning

Task: Write a function bar_chart_high_school that takes the data and plots a bar chart comparing the total percentages of Sex F, M, and A with high school Min degree in the Year 2009. Label the x-axis Sex, the y-axis Percentage, and title the plot Percentage Completed High School by Sex.

Percentage Completed High School by Sex

Tip

Is this visualization effective? You will be asked to consider this in the last section of the assessment.

Save the plot as bar_chart_high_school.png with parameter bbox_inches='tight'.

plt.savefig('bar_chart_high_school.png', bbox_inches='tight')

Seaborn: `plot_hispanic_min_degree`¶

Task: Write a function plot_hispanic_min_degree that takes the data and plots how the percentage of all Hispanic students with degrees have changed between 1990–2010 (inclusive) for high school and bachelor's Min degree. Choose a plot type for this problem and prepare to explain your decision-making process in the writeup. Label the axes and title the plot appropriately.

Save the plot as plot_hispanic_min_degree.png with parameter bbox_inches='tight'.

plt.savefig('plot_hispanic_min_degree.png', bbox_inches='tight')

Hint

Remember that your plot should be readable. You might find the function plt.xticks() helpful!

Scikit-learn: `fit_and_predict_degrees`¶

Task: Train a DecisionTreeRegressor to predict the percentage of degrees attained for a given Sex, Min degree, and Year. Write a function fit_and_predict_degrees that takes the data and returns the test mean squared error as a float. Follow these specific data preprocessing and model training steps.

Preprocessing: Filter the DataFrame to only include the columns for Year, Min degree, Sex, and Total. Drop rows that have missing data in these columns—do not drop any additional rows. One-hot encode str values. Split the columns as needed into input features and labels.

Model training: Once the data has been preprocessed, randomly split the remaining data into 80% training and 20% testing. Fit the model to the training set. Finally, calculate the mean squared error of the model’s test set predictions.

The automated tests only check that the function runs without causing an error. Try comparing the ground truth values and the model’s predictions manually to check for similarity. (It would be bad if the ground truth values were [2, 755, ...] but the predicted values were [1022, 5, ...].) Also, you can try calculating the mean squared error on the training dataset as well as the testing dataset—the error should be lower on the training dataset than on the testing dataset, but make sure that your final result only returns the test error and not do any unnecessary printing.

Info

This education dataset is a time series dataset. It is a bit outside the scope of our class, but we wanted to point out that randomly splitting the train/test data with time series, it is technically inappropriate to do a random split like we ask you to do here. Instead, it is common to use the last $k$ rows as the test set rather than random sampling since the goal is to design a model that predicts the future. By randomly sampling to generate the test set, some of the test examples wil be earlier than some training examples which now makes our test set different than how we would deploy the model in real life! We might then not trust our test accuracy as an accurate estimate for future performance. Even though random sampling is not necessarily appropriate here, we ask you to do it anyways because it’s the most common sampling method for other datasets.

Info

As we learned in class, rather than follow the standard snake_case convention for naming dataset variables, you may name your variables according to the machine learning convention for X_train, X_test, Y_train, and Y_test. Make sure all other variable names are still following the correct (naming conventions)[https://courses.cs.washington.edu/courses/cse163/23su/code_quality/#variable-names]!

`main`¶

Write a main method in hw3.py that loads in the dataset provided and calls all of the functions you wrote. For all of the method calls, you should rely on any default parameters we specified.

Writeup¶

Task: In hw3-writeup.md, apply critical thinking to address the following questions about data visualization, data ethics and justice, and our data analysis methods. You could spend an entire course talking about any of these topics, but we’re just looking for 2 to 4 sentences on each question.

Info

md is the file extension for Markdown, the text formatting language used in Jupyter Notebooks. Markdown offers a natural-looking way to define headings, lists, and links using special characters like #. But you don’t actually need to learn anything to start writing Markdown—you can just write plaintext under each heading in hw3-writeup.md.

Do you think the bar chart for bar_chart_high_school is an effective data visualization?
How and why did you choose the plot for plot_hispanic_min_degree?
Datasets can biased. Bias in data means it might be skewed away from or portray a wrong picture of reality. For example, the data might contain inaccuracies or the methods used to collect the data may have been flawed. Describe a possible bias present in this dataset and why it might have occurred.
Intentions don’t equate outcomes—many people intending to good have caused very real harm in the world. We’ll discuss specific examples of well-intentioned algorithms perpetuating more harm later in the quarter. In computing, that harm is magnified, automated, and reproduced by computers.

Describe an application, analysis, or decision motivated by this dataset with the intended goal of improving educational equity but that ultimately exacerbates social injustice. How can this data analysis lead to further injustice even when designing with equity in mind? In other words, we are trying to think of a way that we could use this data with good intentions, but actually would end up causing more harm.

Info

In addition to writing your response in hw3-writeup.md, please feel free to continue the conversation in the Education Discussion thread. We wanted to open a space so the class can discuss their thoughts about the biases present here.

For more on data justice, see Anna Lauren Hoffman’s Data Ethics course introduction and Data Feminism.

Quality¶

Assessment submissions should pass these checks: flake8 and code quality guidelines. The code quality guidelines are very thorough. For this assessment, the most relevant rules can be found in these sections (new sections bolded):

Naming
Documentation
Global Variables
- Naming Constants
Efficiency and Redundancy
- Boolean Zen
- Loop Zen
- Factoring
- Unnecessary Cases
- Avoid Looping with Pandas
Type Annotations

Submission¶

Submit your work by pressing the Mark button. Submit as often as you want until the deadline for the initial submission. Note that we will only grade your most recent submission. You can view your past submissions using the “Submissions” button.

Please make sure you are familiar with the resources and policies outlined in the syllabus and the take-home assessments page.

THA 3 - Education

Initial Submission by Thursday 07/20 at 11:59 pm.

Submit on Ed

Useful CSE 163 Resources¶

Context¶

Pandas: compare_bachelors_1980¶

Pandas: top_2_2000s¶

Seaborn: line_plot_bachelors¶

Seaborn: bar_chart_high_school¶

Seaborn: plot_hispanic_min_degree¶

Scikit-learn: fit_and_predict_degrees¶

main¶

Writeup¶

Quality¶

Submission¶

THA 3 - Education

Pandas: `compare_bachelors_1980`¶

Pandas: `top_2_2000s`¶

Seaborn: `line_plot_bachelors`¶

Seaborn: `bar_chart_high_school`¶

Seaborn: `plot_hispanic_min_degree`¶

Scikit-learn: `fit_and_predict_degrees`¶

`main`¶