Lesson 16. Model Evaluation

The content for this lesson is adapted from material by Hunter Schafer.

Objectives¶

Earlier, we introduced decision trees, one type of machine learning algorithm that learns a series of if conditions on the features to identify the label for each example. Decision trees are great for visualization since we can see exactly how each tree produces its outcomes. We evaluated decision trees on the basis of accuracy (fraction of correct classifications) and mean squared error (the regression error).

In this lesson, we’ll build on this foundation with an emphasis on evaluating model quality. By the end of this lesson, students will be able to:

Apply get_dummies to apply a one-hot encoding for categorical features.
Apply train_test_split to randomly split a dataset into a training set and test set.
Evaluate machine learning models in terms of overfit and underfit.

Setting up¶

To follow along with the code examples in this lesson, please download the files in the zip folder here:

lesson16.zip

Make sure to unzip the files after downloading! The following are the main files we will work with:

lesson16.ipynb
BankChurners.csv
MLforMushrooms.ipynb
mushrooms.csv

Machine Learning Pipeline, Revisited¶

machine learning icon

So we know how to write code to load in data, separate it into features and labels, train a model on that data, and assess its accuracy. These steps are part of a common machine learning pipeline that describes each data transformation from input to output in a machine learning model.

Last time, we saw that our model had 100% accuracy on the data. In this lesson, we’ll learn why that’s not actually ideal.

Let’s first review the machine learning pipeline by considering a dataset concerning customers of a particular credit card agency and whether those customers decide to close their credit card. Let’s predict the label for the Attrition_Flag column where:

'Existing Customer' kept their account.
'Attrited Customer' closed their account.

import pandas as pd

data = pd.read_csv('BankChurners.csv')
data

Now this already looks slightly different than our previous dataset!

Some of the columns, like 'Gender' and 'Educational Data' don’t store numbers but rather categorical data like 'F' or 'M'.
Some of the rows contain missing values.

Note: We’ll find that categorical data is quite nice to work with in some respects since we know each row will be a value from some known categories. However, categorical data is inherently limiting to what categories we choose to represent or collect. For example, the 'Gender' column in this dataset only allows for "M" and "F". This excludes and mislables anyone who does not adhere to the binary concept of gender. So it’s important to note that the data you have is only as representative and as correct as the process used to collect that data.

Our first step is to remove the missing values because sklearn does not handle missing values. Let’s drop a row if it contains NaN values in any column and save the result!

data = data.dropna()

In this case, we wan to use all the columns as features, so data.dropna() is appropriate. In other situations where you only need a few features, make sure to slice only the features you want to keep before calling dropna.

Our next step is to separate the features and labels.

features = data.loc[:, data.columns != 'Attrition_Flag']
labels = data['Attrition_Flag']

We then create a decision tree model and train it on the data.

# Import the model and accuracy_score
from sklearn.metrics import accuracy_score
from sklearn.tree import DecisionTreeClassifier

# Create an untrained model
model = DecisionTreeClassifier()

# Train it
model.fit(features, labels)

# Make predictions
label_predictions = model.predict(features)

# Print accuracy
print('Accuracy:', accuracy_score(labels, label_predictions))

Food for thought: Why do we use a Classifier here?

The dataset has categorical values for some columns and quantitative values for other columns. This ends up causing problems for sklearn since it assumes all feature values are numeric! How do we handle categorical features?

Categorical Features¶

For this section, let’s suppose we had a smaller dataset that only stored a column for age and another column for gender.

	age	gender
0	22.0	male
1	38.0	female
2	26.0	non-binary
3	35.0	female
4	35.0	male

Let’s consider some strategies for how to address the categorical gender column. Maybe we could come up with some mapping between the categories to a number. If the gender values can be male, female, or non-binary, then we might transform the data such that

1 represents female
2 represents male
3 represents non-binary

While doing this will allow us to train the model, this will cause some unintended consequences for how the model interprets those features. By treating each gender as a number, we are now establishing relationships between the genders that should behave like numbers do. So now we have some non-sensical concepts like:

'male' (2) is considered “greater than” 'female' (1), just like how an age of 10 is less than an age of 32.
'non-binary' (3) is somehow the same as 3x 'female' (1), just like how an age of 40 is 4x an age of 10.

Although a DecisionTreeClassifier might be able to treat each value as a unique category, other approaches such as linear regression might depend on multiplying the numeric value to determine a result. Complicated interactions can unfold at the intersection between the features and model.

To ensure that gender is encoded as separate categories, use a one-hot encoding. Instead of encoding the values in a single column that collapses the data into one numeric dimension, transform that one column to three: one for gender_male, one for gender_female and one for gender_non-binary. Each column will store 0s and 1s to indicate which category that row belonged to.

After doing a one-hot encoding of these columns in the sample data shown above, the gender column will be expanded into 3 as shown in the table below. Notice that each row has one 1 and the rest 0s for the newly-created gender_<value> columns; this is why it’s called a “one-hot encoding”, one of the columns is “hot” (or 1).

	age	gender_female	gender_male	gender_non-binary
0	22.0	0	1	0
1	38.0	1	0	0
2	26.0	0	0	1
3	35.0	1	0	0
4	35.0	0	1	0

pandas provides a function called get_dummies that performs a one-hot encoding or “dummy” encoding.

features = pd.get_dummies(features)
features.columns

Note that we’re only one-hot encoding features, not labels! sklearn is fine with categorical labels. We only need to transform the features.

Assessing Accuracy¶

So now that we have a strategy for handling categorical data, let’s return to the question of assessing model effectiveness! In the cell below, we train the model and evaluate its accuracy on the data.

# Create an untrained model
model = DecisionTreeClassifier()

# Train it
model.fit(features, labels)

# Make predictions
label_predictions = model.predict(features)

# Print accuracy
print('Accuracy:', accuracy_score(labels, label_predictions))

100% accuracy! Unfortunately, the story gets more complicated than this and we should not expect the model to predict 100% accurately on future data. The goal of machine learning is not to overfit the training set by memorizing the labels for the data that we already have, but to learn a model that generalizes to future, unseen data.

The analogy for why this happens is much like a common experience as students when studying for something like an exam. Suppose you have a multiple choice quiz tomorrow, but you can use a practice exam (with answers) to study tonight. You might spend a few hours studying until you feel really confident about the answers on the practice exam. If you were able to get 100% of the practice questions after studying them for a few hours, would you also expect to get 100% of the real test questions correct?

We might be pessimistic and expect that you would not get 100% on the quiz. An “easy” way to get 100% on the practice exam is to just memorize the answers for those specifics questions. If the real quiz only consisted of questions from the practice test, then you would get 100% accuracy! But if the real exam had some new questions, and you didn’t know how to solve any novel problems since you just memorized answers to that one practice exam, you probably wouldn’t do so well.

This is essentially what is happening to our model! We gave it the practice exam (the training set) to train against and then used the exact same practice exam (the training set) to evaluate the model. This will be an overestimate of how it will do in the future since it was able to shape its knowledge around that exact training set.

Instead, if we want to estimate how we think the model will do in the future, we will need to hold out data that it will never see during training and then test it on that data set. We call this held-out data the test set. This is one of the most important rules in all of machine learning:

You need to set aside some data that you never look at while training to assess how your model will do in the future.

This is so important that sklearn provides a function called train_test_split to break your dataset up (randomly) into a training set and test set. It’s important to use a random split, since some data can be naturally sorted by a column which would cause a difference in distribution between training and test. For example, in this credit card dataset, if we took the last 20% of the rows to be the test set, that would be mostly the Attrited Customers examples since they are sorted by the label in the original data.

train_test_split returns a 4-element tuple representing the train/test features and the train/test labels.

from sklearn.model_selection import train_test_split

# Breaks the data into 80% train and 20% test
features_train, features_test, labels_train, labels_test = \
    train_test_split(features, labels, test_size=0.2)

# Print the number of training examples and the number of testing examples
print(len(features_train), len(features_test))

Now we repeat our ML pipeline from before with two important differences:

We now train on features_train and labels_train instead of the entire data
We print out both the accuracy on the training set and the accuracy on the test set

# Create an untrained model
model = DecisionTreeClassifier()

# Train it on the **training set**
model.fit(features_train, labels_train)

# Compute training accuracy
train_predictions = model.predict(features_train)
print('Train Accuracy:', accuracy_score(labels_train, train_predictions))

# Compute test accuracy
test_predictions = model.predict(features_test)
print('Test  Accuracy:', accuracy_score(labels_test, test_predictions))

Wow! While we got 100% accuracy on the training examples correct, we only got a ~68-72% accuracy on the test set! Note there is some randomness in your result since the train/test split is random and the learning algorithm used by the decision tree classifier also has some randomness built into it.

This test accuracy is a much better evaluation of how our model will do in the future if we released it to the world since it is being tested on data it hasn’t seen before.

Why is there this discrepency between train and test anyways? This results from the fact that our model is overfitting. Overfitting happens when a model does too good of a job fitting to the specific training set given to it rather than learning generalizable patterns. In the test-taking analogy, a student overfits to a practice exam if they focus on memorizing the answers on the practice test rather than learning the generalized ideas.

Why do Trees Overfit?¶

Why does the tree overfit in this context and how can we prevent that? Overfitting happens for decision trees by default since we allow the tree to grow really tall and memorize the training set!

If you think back to how the decision tree is learned, a tree successively decide to split the data based on some feature and then build up the tree by separating data as it goes down. By the time the data is split to the bottom levels of the tree, the bottom-level splits are operating on very few data points. These bottom-level splits encode specific details of the training set that don’t generalize to the test set.

To control for overfitting, we can specify the max_depth hyperparameter to control the height of the tree. A hyperparameter is a value that the machine learning algorithm designer sets to determine the complexity of the model. By passing in 2, we are only allowing the model to be tall enough to reach an answer in at most 2 questions.

# Create an untrained model
short_model = DecisionTreeClassifier(max_depth=2)

# Train it on the **training set**
short_model.fit(features_train, labels_train)

# Compute training accuracy
train_predictions = short_model.predict(features_train)
print('Train Accuracy:', accuracy_score(labels_train, train_predictions))

# Compute test accuracy
test_predictions = short_model.predict(features_test)
print('Test  Accuracy:', accuracy_score(labels_test, test_predictions))

from IPython.display import Image, display

import graphviz 
from sklearn.tree import export_graphviz


def plot_tree(model, features, labels):
    dot_data = export_graphviz(model, out_file=None, 
            feature_names=features.columns,  
            class_names=labels.unique(),
            impurity=False,
            filled=True, rounded=True,  
            special_characters=True) 
    graphviz.Source(dot_data).render('tree.gv', format='png')
    display(Image(filename='/home/tree.gv.png'))

# showing the tree
plot_tree(short_model, features_train, labels_train)

Interestingly, the train accuracy and test accuracy have now gotten closer together. It still makes sense that the train accuracy is slightly higher than the test since we would expect to do better on the “practice exam”, but by limiting the depth we prevented the model from overfitting and achieved a better test accuracy!

By setting the max_depth hyperparameter, we’re controlling the model complexity. Short trees are not complex, which helps to avoid overfitting. But if we set the max_depth too low (such as 0 or 1), the model might not be able to capture the underlying patterns and complexity in the data. In other words, short trees can underfit the data. On the other hand, trees that are too tall will overfit to the training set.

The right choice is somewhere in the middle, but there is no principled answer for where. It really depends on your data and problem. This tradeoff is commonly called the “bias-variance tradeoff” where “bias” refers to models that are too simple (will underfit) and “variance” refers to models that are too complex (will overfit).

To summarize:

Models that are too simple generally have low train accuracy and low test accuracy because they are not able to learn a complex-enough set of rules.
Models that are too complex generally have high train accuracy and low test accuracy because they overfit to the training set.

The following code cell tries to demonstrate empirically this phenomena by training models with various max_depths and plotting their train and test accuracies as a function of the max_depth. You do not need to understand how this code works, but you should understand the visualization.

import matplotlib.pyplot as plt
import seaborn as sns
sns.set_theme()

%matplotlib inline

# We re-split the data and put a "random state" to make the results
# not actually random for demonstration purposes. 
# You should not use random_state in your assessments or project.
features_train, features_test, labels_train, labels_test = \
    train_test_split(features, labels, test_size=0.2, random_state=2)

accuracies = []
for i in range(1, 30):
    model = DecisionTreeClassifier(max_depth=i, random_state=1)
    model.fit(features_train, labels_train)

    pred_train = model.predict(features_train)
    train_acc = accuracy_score(labels_train, pred_train)

    pred_test = model.predict(features_test)
    test_acc = accuracy_score(labels_test, pred_test)

    accuracies.append({'max depth': i, 'train accuracy': train_acc, 
                       'test accuracy': test_acc})
accuracies = pd.DataFrame(accuracies)


# Define a function to plot the accuracies

def plot_accuracies(accuracies, column, name):
    """
    Parameters:
        * accuracies: A DataFrame show the train/test accuracy for various max_depths
        * column: Which column to plot (e.g., 'train accuracy')
        * name: The display name for this column (e.g., 'Train')
    """
    sns.relplot(kind='line', x='max depth', y=column, data=accuracies)
    plt.title(f'{name} Accuracy as Max Depth Changes')
    plt.xlabel('Max Depth')
    plt.ylabel(f'{name} Accuracy')
    plt.ylim(0.6, 1)

    plt.show()  # Display the graph

# Plot the graphs
plot_accuracies(accuracies, 'train accuracy', 'Train')
plot_accuracies(accuracies, 'test accuracy', 'Test')

You can see from these graphs, that empirically we can see this bias-variance tradeoff.

When the model is too simple (max_depth is near 1), the training accuracy and test accuracy are closer together, but are quite low when compared to our initial 100% claim on the training accuracy. This is commonly called underfitting.
As the complexity increases (max_depth is increasing) the train and test accuracies do slightly different things:
The training accuracy is always going up. This is because we are giving the model more room to grow, resulting in a more complex model. This extra complexity allows it to perfectly fit the training set in the extreme of max_depth > 20.
The test accuracy is very interesting. It starts by increasing to its highest point around max_depth=5 and then starts decreasing. This is because before max_depth=5, the model was too simple for the task (e.g, underfit) but after max_depth=5 the model starts to overfit (e.g, too complex for the task).

Note that decision trees are just one example of a machine learning algorithm. There are many more advanced machine learning algorithms, which often have more than one hyperparameter to tune. To find the best combination of hyperparameter values, sklearn includes some functions to find the optimal hyperparameters.

When should we se ML?¶

ML has both an awesome potential to solve problems that we didn’t think would be possible before, but it also has a terrifying risk of creating new problems on its own in how it is applied. Unfortunately, there is no “easy answer” to what tasks you should/shouldn’t apply ML to. It requires critical thought and careful consideration of the benefits and harms the model might produce and thinking of how the ML model will be used in the greater system as a whole.

We won’t be able to provide an easy answer in this lesson or this course, but it’s important to pause and bring up some questions and things to consider before applying ML to a task at hand. In a later class, we will talk about some concrete case studies that can help serve as “anchors” for comparison, but here we outline some key things to think about:

Use: How will the model be used?
Impact of Errors: What is the impact the model will have on individuals or groups of people? What is the cost of an error?
Biases: What biases might be present in the data?
Feedback Loops: How will the model’s use reinforce these biases?

Importantly, these are not the only questions that you should ask, but they do serve as a good starting point for your consideration. They also aren’t questions that you can necessarily ask in a linear order, since your thoughts on one might affect your thoughts on another.

We also want to highlight that we don’t necessarily have answers to all of these questions but, they serve as a way to start discussions. But these are all things you should surely at least start thinking about before trying to apply ML to some particular task. The fact that these questions are tough to answer should hopefully make it clear that you should carefully consider what you apply ML to solving.

Use¶

How will the model be used?

In trying to answer “Should we use a model to predict credit card churn?” or “Is it ethical to use a model that predicts credit card churn?”, an incredibly important thing to ask is: “How will the model ultimately be used?”

For this credit card churn example, here are two wildly different ways this ML system could be used by a credit card agency.

Try to predict if an existing customer will churn and if it seems likely they will churn, try to provide a special offer or deal to make them more likely to stay.
Try to predict if a new customer is likely to churn in a short amount of time and if so, don’t let them open an account in the first place.

Notice that the question, “Should I use a model for this task?” might vastly depend on how that model is used. The benefits/drawbacks of using that model depend on the context in that it is used. Is there something different between those cases that might make us think it’s more okay to use the model in one case over the other? Is it okay to use the model in both or neither? Why?

Impact of errors¶

What is the impact the model will have on individuals or groups of people? What is the cost of an error?

One fact you must know about ML is that any ML system will make errors. A “perfect predictor” is nearly impossible to achieve and no one actually tries to reach perfection but just try to approach it. What level of accuracy would you require before feeling comfortable using a model in our credit card example? Does that accuracy requirement depend on the use case where we are trying to give special offers before someone leaves vs. whether or not they can get a credit card in the first place?

Importantly, consider the cost of making an error. This is not asking “what if an error happens?” Errors are guaranteed to happen, so you must have a good idea of what the negative impact is when errors happen. In our credit card example, the error of the first use case (e.g., not giving someone special offers and they churn) seems much less harmless than the second use case (e.g., not giving someone access to a line of credit).

When considering bias in the data in the next section, is there a risk that biases present will disproportionately incur the cost of an error onto a particular group or demographic?

Biases¶

What biases might be present in the data?

A common phrase in ML is “Garbage in → Garbage out” when discussing the quality of data you use to train a model. Most people refer to garbage being the ability for features to describe the data, but there is usually a much more insidious notion of garbage behind the scenes.

Data comes from the real world, and the real world is not equally fair to all people. There are systematic and individual biases that have caused a disproportionate impact on historically and presently marginalized communities. These biases can find their way into our datasets because the data is drawn from the real world where these biases are present and this encoded bias in the data could then be perpetuated by what the model learns from the data. When you hear in the news of ML models exhibiting biased behavior, it normally comes from the fact that they are mirroring the biases found in the data fed to them.

So to think a little more concretely about our credit card example, we might question why do people churn their credit cards, and do we have reason to believe that some implicit or systematic bias can affect the trends seen in the data? Will those biases have an impact on the overall outcomes and potentially negatively hurt particular groups or individuals? Is churn the result of affluent individuals trying to “game” the credit card system, or is churn the sign of economic hardship and losing the ability to afford a line of credit? Is it possible for something like someone’s race or gender (something they have no control over) impact either of these causes of churn (short answer: yes)? And this definitely relates back to how the model might be used since the impact of the model depends on the context it is used in.

Another potential source of bias is the structure of the data itself might exclude certain populations. Either in how the data was collected (forms customers fill out), or what the dataset publisher decides to publish, data is always a reduction of reality to something we try to use in computation. For example, the data presents a Gender column that shows a binary choice between M and F, when in reality, gender is a much more complicated notion. There is a tension between simple data we can analyze (something categorical like M/F) and something more open-ended that can account for the diversity of human experience. You should think carefully about who is represented and who is excluded in your data.

Feedback Loops¶

How will the model’s use reinforce these biases?

A very pernicious way these biases can sneak into an ML system and how it’s used can come in the form of what is known as a “feedback loop”. This is a bit harder to see in our credit card example, so we will switch the example to this section to something known as “predictive policing”. The idea is to better allocate police resources by having more police patrol in areas with higher reported incidences of crime, in an effort to prevent more violent crimes from happening.

While it sounds great in theory to try to not waste police resources (since that costs money that we could put into other resources than just policing), this can create very large unintended side effects. Cathy O’Neil describes this phenomenon well in her book Weapons of Math Destruction where she highlights that a well-intended system to reduce violent crimes (e.g., burglaries, assault, murder) can reinforce systems of oppression when the models the police used include all crime data, including low-level or “nuisance crimes”. O’Neil writes:

These nuisance crimes are endemic to many impoverished neighborhoods. In some places police call them antisocial behavior, or ASB. Unfortunately, including them in the model threatens to skew the analysis. Once the nuisance data flows into a predictive model, more police are drawn into those neighborhoods, where they’re more likely to arrest more people. After all, even if their objective is to stop [list of gruesome crimes], they’re bound to have slow periods. It’s the nature of patrolling. And if a patrolling cop sees a couple of kids who look no older than sixteen guzzling from a bottle in a brown bag, he stops them. These types of low level crimes populate their models with more and more dots, and the models send the cops back to the same neighborhood.

This creates a pernicious feedback loop. The policing itself spawns new data, which justifies more policing. And our prisons fill up with hundreds of thousands of people found guilty of victimless crimes. Most of them come from impoverished neighborhoods, and most are black or Hispanic. So even if a model is color blind, the result of it is anything but. In our largely segregated cities, geography is a highly effective proxy for race.

How the model is used can create a feedback loop, which can further exacerbate biases found in the data when the results of the model affect the results in the future. In the presence of bias in the data originally, feedback loops have the potential to quickly amplify them. Even if the bias is small to start, this feedback loop can make it into a completely broken system.

⏸️ Pause and 🧠 Think¶

Take a moment to review the following concepts and reflect on your own understanding. A good temperature check for your understanding is asking yourself whether you might be able to explain these concepts to a friend outside of this class.

Here’s what we covered in this lesson:

Using categorical features
- One-hot encoding
Training/testing split
Updated ML Pipeline
ML applications and ethics

Here are some other guiding exercises and questions to help you reflect on what you’ve seen so far:

In your own words, write a few sentences summarizing what you learned in this lesson.
What did you find challenging in this lesson? Come up with some questions you might ask your peers or the course staff to help you better understand that concept.
What was familiar about what you saw in this lesson? How might you relate it to things you have learned before?
Throughout the lesson, there were a few Food for thought questions. Try exploring one or more of them and see what you find.

In-Class¶

When you come to class, we will work together on answering some discussion questions and completing MLforMushrooms.ipynb! Make sure you have a way of opening and running this file.

Canvas Quiz¶

All done with the lesson? Complete the Canvas Quiz linked here!