Learning Algorithms¶

Inspired by Sam Lau, who co-authored the Learning Data Science book.

In this lesson, we'll introduce machine learning from the ground up. By the end of this lesson, students will be able to:

  • Describe the difference between traditional algorithms and machine learning algorithms.
  • Identify the components of a machine learning model and dataset features and labels.
  • Apply sklearn to train a decision tree for classification and regression tasks.

A while back, we discussed data visualization using the Puget Sound Clean Air Agency's EPA-grade air quality sensors (AQS). However, these sensors are typically expensive, costing anywhere between \$15,000 and \$40,000 each, making it hard to deploy a large number of these sensors. Furthermore, EPA-grade AQS measurements also undergo calibration and accuracy checks that lead to delays of one or two hours, leading to data that is very accurate but not necessarily timely.

In contrast, "PurpleAir makes sensors that empower Community Scientists who collect hyper-local air quality data and share it with the public." In this lesson, we'll learn how we can use more accurate but less timely AQS measurements to calibrate the less accurate but more timely PurpleAir sensor (PAS) measurements so that we can provide the best information to the general public. The concepts in this lesson are actually used in the real world when you visit the EPA AirNow Fire and Smoke Map: the PAS data in this map are calibrated using the approach we will learn today.

InĀ [2]:
import pandas as pd
import seaborn as sns

sns.set_theme()

Our dataset includes over 12,000 matched observations where we've paired each AQS measurement with a nearby PAS measurement, along with 3 other variables that experts have identified as potentially impacting PAS measurement quality. The dataset includes 5 columns:

  • The very accurate EPA-grade air quality sensor (AQS) measurement of the PM2.5 value.
  • The temperature in degrees celsius.
  • The relative humidity as a percentage between 0% and 100%.
  • The dew point, where a higher dew point means more moisture in the air.
  • The less accurate but more timely PurpleAir sensor (PAS) measurement of the PM2.5 value.
InĀ [2]:
sensor_data = pd.read_csv("sensor_data.csv")
sensor_data
Out[2]:
AQS temp humidity dew PAS
0 6.7 18.027263 38.564815 3.629662 8.616954
1 3.8 16.115280 49.404315 5.442318 3.493916
2 4.0 19.897634 29.972222 1.734051 3.799601
3 4.7 21.378334 32.474513 4.165624 4.369691
4 3.2 18.443822 43.898226 5.867611 3.191071
... ... ... ... ... ...
12092 5.5 -12.101337 54.188889 -19.555834 2.386120
12093 16.8 4.159967 56.256030 -3.870659 32.444987
12094 15.6 1.707895 65.779221 -4.083768 25.297018
12095 14.0 -14.380144 48.206481 -23.015378 8.213208
12096 5.8 5.081813 52.200000 -4.016401 9.436011

12097 rows Ɨ 5 columns

How can we use the PAS measurement to predict the AQS measurement? Or, should we frame our research question the other way around?

InĀ [Ā ]:
# Do we want to predict AQS from the PAS value?
feature = "PAS"
target  = "AQS"
InĀ [3]:
# Or predict PAS from the AQS value?
feature = "AQS"
target  = "PAS"

Let's visualize the relationship between these two variables by creating a scatter plot.

InĀ [4]:
sns.relplot(sensor_data, x=feature, y=target)
Out[4]:
<seaborn.axisgrid.FacetGrid at 0x7d31515dce10>
No description has been provided for this image

Guessing game¶

One way to predict the target variable from the feature variable is by choosing a line that best describes the trend, called a regression line.

InĀ [8]:
def plot_line(slope, intercept=0):
    grid = sns.relplot(sensor_data, x=feature, y=target)
    max_x = sensor_data[feature].max()
    grid.facet_axis(0, 0).plot([0, max_x], [intercept, slope * max_x + intercept])
    grid.set(title=f"Slope = {slope:.2f}, intercept = {intercept:.2f}")
    return grid


plot_line(2.5)
Out[8]:
<seaborn.axisgrid.FacetGrid at 0x7d315138b4d0>
No description has been provided for this image

We can visualize all our guesses so far by plotting them against their mean squared errors on a loss surface.

InĀ [Ā ]:
# Need a metric to measure how good of a line that we have
squared_errors = (sensor_data[feature] * s - sensor_data[target]) ** 2
mean_squared_error = sum(squared_errors) / len(squared_errors)
# Mean absolute error
# Mean squared error
InĀ [28]:
def plot_loss(slopes):
    from sklearn.metrics import mean_squared_error, mean_absolute_error
    losses = [mean_squared_error(sensor_data[feature] * s, sensor_data[target]) for s in slopes]
    grid = sns.relplot(x=slopes, y=losses)
    grid.set(title="Loss surface", xlabel="Slope", ylabel="MSE", xlim=[1, 3], ylim=[0, None])
    return grid


plot_loss([2, 2.1, 2.2, 2.5, 1.9, 1.8, 2.15, 2.08, 2.4, 1.5, 1.25, 1.3, 1.6, 1.65])
Out[28]:
<seaborn.axisgrid.FacetGrid at 0x7d3147e2c2d0>
No description has been provided for this image

Gradient descent¶

What differentiates machine learning from just random guessing is the use of an algorithm to find the best slope. How do we write a machine learning algorithm that selects the best possible line according to mean squared error? One way to do this is by applying concepts from linear algebra to solve this question by selecting a random initial theta (slope) value and then rolling down the hill toward the minimum value at the bottom of the bowl. We can express this using numpy, a numeric computation module for Python that is a building block for pandas and sklearn (as we'll see later).

$$ \nabla_{\!\theta}\; \text{MSE}(\boldsymbol{\theta}, \mathbf{X}, \mathbf{y}) = -\frac{2}{n}(\mathbf{X}^\top \mathbf{y} - \mathbf{X}^\top \mathbf{X} \boldsymbol{\theta}) $$

InĀ [13]:
import numpy as np


def grad_mse(theta, X, y):
    return np.array(- 2 / len(X) * (X.T @ y - X.T @ X * theta))


thetas = [np.random.random()]
print("Random initial theta value:", thetas[-1])
Random initial theta value: 0.03631116004542245
InĀ [14]:
plot_line(thetas[-1])
Out[14]:
<seaborn.axisgrid.FacetGrid at 0x7d31482efc90>
No description has been provided for this image

We can then take a small step in the opposite direction of the gradient to roll down the hill until we converge on a good guess. To make this a machine learning algorithm, we simply put the update step in a loop until the theta values no longer noticeably change.

InĀ [26]:
plot_line(thetas[-1])
plot_loss(thetas)

# Take a small step in the opposite direction of the gradient to roll downhill
thetas.append(thetas[-1] - 0.002 * grad_mse(thetas[-1], sensor_data[feature], sensor_data[target]))
No description has been provided for this image
No description has been provided for this image

Linear regression models¶

What we've just described is the gradient descent algorithm for fitting a linear regression model. A linear regression model is a machine learning model that is used to predict a numeric value (like AQS measurements) using a linear combination of coefficients and features (columns from the training dataset). scikit-learn provides an easy way to do define a linear regression model, fit our training dataset X to the target values y, and examine the coefficients to look inside the model.

InĀ [29]:
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
X = sensor_data[[feature]]
y = sensor_data[target]
# Linear regression algorithm
# When trained on data (i.e. you've picked a slope), you have a linear regression model
reg = LinearRegression().fit(X, y)

print("Model:", " + ".join([f"{reg.intercept_:.2f}"] + [f"{coef:.2f}({X.columns[i]})" for i, coef in enumerate(reg.coef_)]))
print("Error:", mean_squared_error(y, reg.predict(X)))
plot_line(reg.coef_[0], reg.intercept_)
Model: -3.14 + 1.93(AQS)
Error: 30.259833701206063
Out[29]:
<seaborn.axisgrid.FacetGrid at 0x7d3147cb4b10>
No description has been provided for this image

This procedure is more or less how lmplot works!

InĀ [30]:
sns.lmplot(sensor_data, x=feature, y=target)
Out[30]:
<seaborn.axisgrid.FacetGrid at 0x7d3147dcee90>
No description has been provided for this image

While lmplot is nice for drawing the regression line if all we care about is predicting the target from the (singular) feature, the advantage of designing our own LinearRegression model with sklearn is that we can include other variables to reduce the loss. The final model that the EPA has two feature: the sensor measurement and the relative humidity.

InĀ [34]:
# temp	humidity	dew	
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
X = sensor_data[[feature, "temp", "humidity", "dew"]]
y = sensor_data[target]
reg = LinearRegression().fit(X, y)

print("Model:", " + ".join([f"{reg.intercept_:.2f}"] + [f"{coef:.2f}({X.columns[i]})" for i, coef in enumerate(reg.coef_)]))
print("Error:", mean_squared_error(y, reg.predict(X)))
Model: -11.57 + 1.91(AQS) + 0.02(temp) + 0.17(humidity) + -0.04(dew)
Error: 23.01902262685766
InĀ [36]:
# temp	humidity	dew	
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
# Question: Which features are actually useful for predicting the target value?
# Next time, we'll figure out how we might evaluate which variables to include in our model
X = sensor_data[[feature, "temp", "humidity"]]
y = sensor_data[target]
reg = LinearRegression().fit(X, y)

print("Model:", " + ".join([f"{reg.intercept_:.2f}"] + [f"{coef:.2f}({X.columns[i]})" for i, coef in enumerate(reg.coef_)]))
print("Error:", mean_squared_error(y, reg.predict(X)))
# The error is in units square of the target value (ug/m^3 of PM2.5)^2
Model: -10.35 + 1.91(AQS) + -0.02(temp) + 0.16(humidity)
Error: 23.02688160738098
InĀ [37]:
mean_squared_error(y, reg.predict(X)) ** (0.5)
# Here, we've converted this back to ug/m^3 of PM2.5 by taking the square root
Out[37]:
4.798633306200942

Classification versus regression¶

Everything we've seen so far fall under the category of regression, where we aim to predict a numeric target value (one column) from one or more features (one or more other columns). The other main category of problems is classification, which is just like regression except we aim to predict a categorical target value. For example, we might want to answer the question: How can we predict whether a house belongs in NY or SF from its beds, baths, price, year of construction, square footage, price per square foot, and elevation?

InĀ [3]:
homes = pd.read_csv("homes.csv")
homes
# There are forms of linear classifiers (like logistic models)
Out[3]:
beds bath price year_built sqft price_per_sqft elevation city
0 2.0 1.0 999000 1960 1000 999 10 NY
1 2.0 2.0 2750000 2006 1418 1939 0 NY
2 2.0 2.0 1350000 1900 2150 628 9 NY
3 1.0 1.0 629000 1903 500 1258 9 NY
4 0.0 1.0 439000 1930 500 878 10 NY
... ... ... ... ... ... ... ... ...
487 5.0 2.5 1800000 1890 3073 586 76 SF
488 2.0 1.0 695000 1923 1045 665 106 SF
489 3.0 2.0 1650000 1922 1483 1113 106 SF
490 1.0 1.0 649000 1983 850 764 163 SF
491 3.0 2.0 995000 1956 1305 762 216 SF

492 rows Ɨ 8 columns

Decision trees are a machine learning algorithm that can be used for classification (and also, as it turns out, regression too). Decision trees learn a nested if-else logical hierarchy to fit a training dataset. In the following visualization, the color and opacity of each box represents whether that subset of homes is more likely to be in NY or SF. Each line of text within each node indicates:

  1. The if-else condition: if true, go left; if false, go right.
  2. The percentage of samples represented by that node.
  3. Within this sample, the proportion that belong in NY versus SF.
  4. The majority class in that category corresponding to the bigger number on line 3.
InĀ [9]:
from sklearn.tree import DecisionTreeClassifier, plot_tree
X = homes[["beds", "bath", "price", "year_built", "sqft", "price_per_sqft", "elevation"]]
y = homes["city"]
# Code for creating a new DecisionTreeClassifier specifies what we call "hyperparameters"
clf = DecisionTreeClassifier(max_depth=1).fit(X, y)

import matplotlib.pyplot as plt
plt.figure(dpi=300)
plot_tree(
    clf,
    feature_names=X.columns,
    class_names=["NY", "SF"],
    label="root",
    filled=True,
    impurity=False,
    proportion=True,
    rounded=False
);
No description has been provided for this image
InĀ [6]:
from sklearn.tree import DecisionTreeClassifier, plot_tree
X = homes[["beds", "bath", "price", "year_built", "sqft", "price_per_sqft", "elevation"]]
y = homes["city"]
# Code for creating a new DecisionTreeClassifier specifies what we call "hyperparameters"
clf = DecisionTreeClassifier().fit(X, y)

import matplotlib.pyplot as plt
plt.figure(dpi=300)
plot_tree(
    clf,
    feature_names=X.columns,
    class_names=["NY", "SF"],
    label="root",
    filled=True,
    impurity=False,
    proportion=True,
    rounded=False
);
No description has been provided for this image

We can also use this dataset for regression too. Write a one-line expression to train a DecisionTreeRegressor model to predict the price of a home in this dataset from all other variables.

InĀ [5]:
from sklearn.tree import DecisionTreeRegressor

reg = DecisionTreeRegressor().fit(homes.drop("price", axis=1), homes["price"])
reg
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
/tmp/ipykernel_211/297773281.py in ?()
      1 from sklearn.tree import DecisionTreeRegressor
      2 
----> 3 reg = DecisionTreeRegressor().fit(homes.drop("price", axis=1), homes["price"])
      4 reg

/opt/conda/lib/python3.11/site-packages/sklearn/base.py in ?(estimator, *args, **kwargs)
   1470                 skip_parameter_validation=(
   1471                     prefer_skip_nested_validation or global_skip_validation
   1472                 )
   1473             ):
-> 1474                 return fit_method(estimator, *args, **kwargs)

/opt/conda/lib/python3.11/site-packages/sklearn/tree/_classes.py in ?(self, X, y, sample_weight, check_input)
   1373         self : DecisionTreeRegressor
   1374             Fitted estimator.
   1375         """
   1376 
-> 1377         super()._fit(
   1378             X,
   1379             y,
   1380             sample_weight=sample_weight,

/opt/conda/lib/python3.11/site-packages/sklearn/tree/_classes.py in ?(self, X, y, sample_weight, check_input, missing_values_in_feature_mask)
    248             check_X_params = dict(
    249                 dtype=DTYPE, accept_sparse="csc", force_all_finite=False
    250             )
    251             check_y_params = dict(ensure_2d=False, dtype=None)
--> 252             X, y = self._validate_data(
    253                 X, y, validate_separately=(check_X_params, check_y_params)
    254             )
    255 

/opt/conda/lib/python3.11/site-packages/sklearn/base.py in ?(self, X, y, reset, validate_separately, cast_to_ndarray, **check_params)
    641                 # :(
    642                 check_X_params, check_y_params = validate_separately
    643                 if "estimator" not in check_X_params:
    644                     check_X_params = {**default_check_params, **check_X_params}
--> 645                 X = check_array(X, input_name="X", **check_X_params)
    646                 if "estimator" not in check_y_params:
    647                     check_y_params = {**default_check_params, **check_y_params}
    648                 y = check_array(y, input_name="y", **check_y_params)

/opt/conda/lib/python3.11/site-packages/sklearn/utils/validation.py in ?(array, accept_sparse, accept_large_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, ensure_min_samples, ensure_min_features, estimator, input_name)
    994                         )
    995                     array = xp.astype(array, dtype, copy=False)
    996                 else:
    997                     array = _asarray_with_order(array, order=order, dtype=dtype, xp=xp)
--> 998             except ComplexWarning as complex_warning:
    999                 raise ValueError(
   1000                     "Complex data not supported\n{}\n".format(array)
   1001                 ) from complex_warning

/opt/conda/lib/python3.11/site-packages/sklearn/utils/_array_api.py in ?(array, dtype, order, copy, xp)
    517         # Use NumPy API to support order
    518         if copy is True:
    519             array = numpy.array(array, order=order, dtype=dtype)
    520         else:
--> 521             array = numpy.asarray(array, order=order, dtype=dtype)
    522 
    523         # At this point array is a NumPy ndarray. We convert it to an array
    524         # container that is consistent with the input's namespace.

/opt/conda/lib/python3.11/site-packages/pandas/core/generic.py in ?(self, dtype, copy)
   2149     def __array__(
   2150         self, dtype: npt.DTypeLike | None = None, copy: bool_t | None = None
   2151     ) -> np.ndarray:
   2152         values = self._values
-> 2153         arr = np.asarray(values, dtype=dtype)
   2154         if (
   2155             astype_is_view(values.dtype, arr.dtype)
   2156             and using_copy_on_write()

ValueError: could not convert string to float: 'NY'

Consider each of the following tasks and answer whether they would be best modeled as classification or regression.

Predict whether an email is spam or not spam.

Classification, since the target value is the category "spam" or "not spam".

Predict the number of views a video will receive based on subscriber count.

Regression, since the target value is a number.

Predict the next word to appear in a sentence.

Classification, since the target value is to choose from the dictionary of all possible next words.