CSE 163, Winter 2020: Homework 3: Part 2

Machine Learning using scikit-learn

Now you will be making a simple machine learning model for the provided education data using scikit-learn. Complete this in a function called fit_and_predict_degrees that takes the data as a parameter and returns the test mean squared error as a float. This may sound like a lot, so we've broken it down into four steps for you:

  1. Filter the DataFrame to only include the columns for year, degree type, sex, and total. Drop any rows that have missing data for these columns and then convert string values to their dummy encoding. Split the columns as needed into input values and target values.
  2. Split the dataset into 80% for training and 20% for testing.
  3. Train a decision tree regressor model to take in year, degree type, and sex to predict the percent of individuals of the specified sex to achieve that degree type in the specified year.
  4. Use your model to predict on the test set. Calculate the accuracy of your predictions using the mean squared error of the test dataset.

You do not need to anything fancy like find the optimal settings for parameters to maximize performance. We just want you to start simple and train a model from scratch! The reference below has all the methods you will need for this section!

scikit-learn Reference

You can find our reference sheet for machine learning with scikit-learn here. This reference sheet has information about general scikit-learn calls that are helpful, as well as how to train the tree models we talked about in class.

Development Strategy

Like in Part 1, it can be difficult to write tests for this section. Machine Learning is all about uncertainty, and it's often difficult to write tests to know what is right. This requires diligence and making sure you are very careful with the method calls you make. To help you with this, we've provided some alternative ways to gain confidence in your result:

  • Print your test y values and your predictions to compare them manually. They won't be exactly the same, but you should notice that they have some correlation. For example, I might be concerned if my test y values were [2, 755, …] and my predicted values were [1022, 5...] because they seem to not correlate at all.
  • Calculate your mean squared error on your training data as well as your test data. The error should be lower on your training data than on your testing data.