Learning Algorithms¶
Inspired by Sam Lau, who co-authored the Learning Data Science book.
In this lesson, we'll introduce machine learning from the ground up. By the end of this lesson, students will be able to:
- Describe the difference between traditional algorithms and machine learning algorithms.
- Identify the components of a machine learning model and dataset features and labels.
- Apply
sklearn
to train a decision tree for classification and regression tasks.
A while back, we discussed data visualization using the Puget Sound Clean Air Agency's EPA-grade air quality sensors (AQS). However, these sensors are typically expensive, costing anywhere between \$15,000 and \$40,000 each, making it hard to deploy a large number of these sensors. Furthermore, EPA-grade AQS measurements also undergo calibration and accuracy checks that lead to delays of one or two hours, leading to data that is very accurate but not necessarily timely.
In contrast, "PurpleAir makes sensors that empower Community Scientists who collect hyper-local air quality data and share it with the public." In this lesson, we'll learn how we can use more accurate but less timely AQS measurements to calibrate the less accurate but more timely PurpleAir sensor (PAS) measurements so that we can provide the best information to the general public. The concepts in this lesson are actually used in the real world when you visit the EPA AirNow Fire and Smoke Map: the PAS data in this map are calibrated using the approach we will learn today.
import pandas as pd
import seaborn as sns
sns.set_theme()
Our dataset includes over 12,000 matched observations where we've paired each AQS measurement with a nearby PAS measurement, along with 3 other variables that experts have identified as potentially impacting PAS measurement quality. The dataset includes 5 columns:
- The very accurate EPA-grade air quality sensor (AQS) measurement of the PM2.5 value.
- The temperature in degrees celsius.
- The relative humidity as a percentage between 0% and 100%.
- The dew point, where a higher dew point means more moisture in the air.
- The less accurate but more timely PurpleAir sensor (PAS) measurement of the PM2.5 value.
How can we use the PAS measurement together with the temperature, humidity, and dew point to predict the AQS measurement?
sensor_data = pd.read_csv("sensor_data.csv")
sensor_data
AQS | temp | humidity | dew | PAS | |
---|---|---|---|---|---|
0 | 6.7 | 18.027263 | 38.564815 | 3.629662 | 8.616954 |
1 | 3.8 | 16.115280 | 49.404315 | 5.442318 | 3.493916 |
2 | 4.0 | 19.897634 | 29.972222 | 1.734051 | 3.799601 |
3 | 4.7 | 21.378334 | 32.474513 | 4.165624 | 4.369691 |
4 | 3.2 | 18.443822 | 43.898226 | 5.867611 | 3.191071 |
... | ... | ... | ... | ... | ... |
12092 | 5.5 | -12.101337 | 54.188889 | -19.555834 | 2.386120 |
12093 | 16.8 | 4.159967 | 56.256030 | -3.870659 | 32.444987 |
12094 | 15.6 | 1.707895 | 65.779221 | -4.083768 | 25.297018 |
12095 | 14.0 | -14.380144 | 48.206481 | -23.015378 | 8.213208 |
12096 | 5.8 | 5.081813 | 52.200000 | -4.016401 | 9.436011 |
12097 rows × 5 columns
Let's use data visualization to understand this dataset by scatter-plotting PAS to AQS. (Why do we place PAS on the x-axis and AQS on the y-axis?)
sns.relplot(sensor_data, x="PAS", y="AQS")
<seaborn.axisgrid.FacetGrid at 0x7d4d2d24e050>
Guessing game¶
Let's start with a simpler task: How can we use the PAS measurement alone to predict the AQS measurement? To do this, let's choose a line that best describes the trend, or the regression line that is produced by calling lmplot
.
sns.lmplot(sensor_data, x="PAS", y="AQS")
<seaborn.axisgrid.FacetGrid at 0x7d4de8efae90>
def plot_line(slope, intercept=0):
grid = sns.relplot(sensor_data, x="PAS", y="AQS")
grid.facet_axis(0, 0).plot([0, 260], [intercept, slope * 260 + intercept], c="orange")
grid.set(title=f"Slope = {slope:.2f}, intercept = {intercept:.2f}")
return grid
plot_line(0.4)
<seaborn.axisgrid.FacetGrid at 0x7d4d24e41330>
sensor_data["AQS"]
0 6.7 1 3.8 2 4.0 3 4.7 4 3.2 ... 12092 5.5 12093 16.8 12094 15.6 12095 14.0 12096 5.8 Name: AQS, Length: 12097, dtype: float64
sensor_data["PAS"] * 0.4 # f(x) = 0.4x
0 3.446781 1 1.397567 2 1.519840 3 1.747876 4 1.276428 ... 12092 0.954448 12093 12.977995 12094 10.118807 12095 3.285283 12096 3.774405 Name: PAS, Length: 12097, dtype: float64
absolute_error = sensor_data["AQS"] - sensor_data["PAS"] * 0.4
absolute_error
0 3.253219 1 2.402433 2 2.480160 3 2.952124 4 1.923572 ... 12092 4.545552 12093 3.822005 12094 5.481193 12095 10.714717 12096 2.025595 Length: 12097, dtype: float64
absolute_error[absolute_error < 0]
1081 -1.658926 1088 -1.476082 1325 -0.325128 1328 -0.503032 1581 -0.004646 ... 11785 -1.914624 11786 -2.722031 11821 -0.029849 12005 -2.131291 12006 -3.051198 Length: 662, dtype: float64
absolute_error.abs().mean()
3.377812356842188
(absolute_error ** 2).mean()
16.815882470137378
What differentiates machine learning from just human guessing is the use of an algorithm to find the best line, which requires a metric for the quality of a trend.
We can visualize all of our guesses so far by plotting them against their mean squared errors on what's called a loss surface.
def plot_loss(slopes):
losses = []#[... for s in slopes]
for slope in slopes:
absolute_error = sensor_data["AQS"] - sensor_data["PAS"] * slope
losses.append((absolute_error**2).mean())
grid = sns.relplot(x=slopes, y=losses)
grid.set(title="Loss surface", xlabel="Slope", ylabel="MSE", xlim=[0, 1], ylim=[0, None])
return grid
plot_loss([0.1 * i for i in range(11)])
<seaborn.axisgrid.FacetGrid at 0x7d4d24cd3c10>
Gradient descent¶
So how do we write a machine learning algorithm that can optimize this metric and find the minimum mean squared error in the loss surface so that it selects the best possible line? Machine learning scientists can apply concepts from linear algebra to solve this system by selecting a random initial theta (slope) value and then rolling down the hill toward the minimum value at the bottom of the bowl. We can express this using numpy
, a numeric computation module for Python that is a building block for pandas
and sklearn
(as we'll see later).
$$ \nabla_{\!\theta}\; \text{MSE}(\boldsymbol{\theta}, \mathbf{X}, \mathbf{y}) = -\frac{2}{n}(\mathbf{X}^\top \mathbf{y} - \mathbf{X}^\top \mathbf{X} \boldsymbol{\theta}) $$
import numpy as np
def grad_mse(theta, X, y):
return np.array(- 2 / len(X) * (X.T @ y - X.T @ X * theta))
thetas = [np.random.random()]
print("Random initial theta value:", thetas[-1])
Random initial theta value: 0.7511928515894967
We can then take a small step in the opposite direction of the gradient to roll down the hill until we converge on a good guess. To make this a machine learning algorithm, we simply put the update step in a loop until the theta values no longer noticeably change.
plot_line(thetas[-1])
plot_loss(thetas)
# Take a small step in the opposite direction of the gradient to roll downhill
thetas.append(thetas[-1] - 0.002 * grad_mse(thetas[-1], sensor_data["PAS"], sensor_data["AQS"]))
Linear regression models¶
What we've just described is the gradient descent algorithm for fitting a linear regression model. A linear regression model is a machine learning model that is used to predict a numeric value (like AQS measurements) using a linear combination of coefficients and features (columns from the training dataset). scikit-learn provides an easy way to do define a linear regression model, fit our training dataset X
to the target values y
, and examine the coefficients to look inside the model.
f(x) = Ax + b
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
X = sensor_data[["PAS"]]
y = sensor_data["AQS"]
reg = LinearRegression().fit(X, y)
print("Model:", " + ".join([f"{reg.intercept_:.2f}"] + [f"{coef:.2f}({X.columns[i]})" for i, coef in enumerate(reg.coef_)]))
print("Error:", mean_squared_error(y, reg.predict(X)))
plot_line(reg.coef_[0], reg.intercept_)
Model: 3.18 + 0.40(PAS) Error: 6.343125879204806
<seaborn.axisgrid.FacetGrid at 0x7d4d249c4af0>
This procedure is more or less how lmplot
works!
sns.lmplot(sensor_data, x="PAS", y="AQS")
<seaborn.axisgrid.FacetGrid at 0x7d4d24e95c90>
But the advantage of designing our own model is that we can combine other variables to reduce the mean squared error loss. The final model that the EPA uses only takes into account the sensor measurement and the relative humidity
, but not any other variables. Later, we'll learn why they made this decision.
X = sensor_data[["PAS", "humidity"]]
y = sensor_data["AQS"]
reg = LinearRegression().fit(X, y)
print("Model:", " + ".join([f"{reg.intercept_:.2f}"] + [f"{coef:.2f}({X.columns[i]})" for i, coef in enumerate(reg.coef_)]))
print("Error:", mean_squared_error(y, reg.predict(X)))
plot_line(reg.coef_[0], reg.intercept_)
Model: 6.25 + 0.43(PAS) + -0.07(humidity) Error: 5.144725853340175
<seaborn.axisgrid.FacetGrid at 0x7d4d23daf670>
Classification versus regression¶
Everything we've seen so far fall under the category of regression, where we aim to predict a numeric target value (one column) from one or more features (one or more other columns). The other main category of problems is classification, which is just like regression except we aim to predict a categorical target value. For example, we might want to answer the question: How can we predict whether a house belongs in NY
or SF
from its beds, baths, price, year of construction, square footage, price per square foot, and elevation?
homes = pd.read_csv("homes.csv")
homes
beds | bath | price | year_built | sqft | price_per_sqft | elevation | city | |
---|---|---|---|---|---|---|---|---|
0 | 2.0 | 1.0 | 999000 | 1960 | 1000 | 999 | 10 | NY |
1 | 2.0 | 2.0 | 2750000 | 2006 | 1418 | 1939 | 0 | NY |
2 | 2.0 | 2.0 | 1350000 | 1900 | 2150 | 628 | 9 | NY |
3 | 1.0 | 1.0 | 629000 | 1903 | 500 | 1258 | 9 | NY |
4 | 0.0 | 1.0 | 439000 | 1930 | 500 | 878 | 10 | NY |
... | ... | ... | ... | ... | ... | ... | ... | ... |
487 | 5.0 | 2.5 | 1800000 | 1890 | 3073 | 586 | 76 | SF |
488 | 2.0 | 1.0 | 695000 | 1923 | 1045 | 665 | 106 | SF |
489 | 3.0 | 2.0 | 1650000 | 1922 | 1483 | 1113 | 106 | SF |
490 | 1.0 | 1.0 | 649000 | 1983 | 850 | 764 | 163 | SF |
491 | 3.0 | 2.0 | 995000 | 1956 | 1305 | 762 | 216 | SF |
492 rows × 8 columns
Let's learn about decision trees, a machine learning algorithm that can be used for classification (and also, as it turns out, regression too). Decision trees learn a nested if-then-else logical hierarchy to fit a training dataset. In the following visualization, the color and opacity of each box represents whether that subset of homes is more likely to be in NY
or SF
. The notes inside each node indicate information about the values:
- The first line shows the condition. If the condition is true, go left; if not, go right.
- The second line shows the percentage of samples represented by that node.
- The third line shows the proportion of homes within that sample that belong in
["NY", "SF"]
. - The fourth line shows the majority class in that category, corresponding to the bigger number on line 3.
from sklearn.tree import DecisionTreeClassifier, plot_tree
X = homes.drop("city", axis=1)
y = homes["city"]
clf = DecisionTreeClassifier(max_depth=2).fit(X, y)
import matplotlib.pyplot as plt
plt.figure(dpi=300)
plot_tree(
clf,
feature_names=X.columns,
class_names=["NY", "SF"],
label="root",
filled=True,
impurity=False,
proportion=True,
rounded=False
);
We can also use this dataset for regression too. Write a one-line expression to train a DecisionTreeRegressor
model to predict the price of a home in this dataset from all other variables.
homes.drop("price", axis=1)
beds | bath | year_built | sqft | price_per_sqft | elevation | city | |
---|---|---|---|---|---|---|---|
0 | 2.0 | 1.0 | 1960 | 1000 | 999 | 10 | NY |
1 | 2.0 | 2.0 | 2006 | 1418 | 1939 | 0 | NY |
2 | 2.0 | 2.0 | 1900 | 2150 | 628 | 9 | NY |
3 | 1.0 | 1.0 | 1903 | 500 | 1258 | 9 | NY |
4 | 0.0 | 1.0 | 1930 | 500 | 878 | 10 | NY |
... | ... | ... | ... | ... | ... | ... | ... |
487 | 5.0 | 2.5 | 1890 | 3073 | 586 | 76 | SF |
488 | 2.0 | 1.0 | 1923 | 1045 | 665 | 106 | SF |
489 | 3.0 | 2.0 | 1922 | 1483 | 1113 | 106 | SF |
490 | 1.0 | 1.0 | 1983 | 850 | 764 | 163 | SF |
491 | 3.0 | 2.0 | 1956 | 1305 | 762 | 216 | SF |
492 rows × 7 columns
from sklearn.tree import DecisionTreeRegressor
reg = DecisionTreeRegressor().fit(homes.drop("price", axis=1), homes["price"])
reg
--------------------------------------------------------------------------- ValueError Traceback (most recent call last) /tmp/ipykernel_167/297773281.py in ?() 1 from sklearn.tree import DecisionTreeRegressor 2 ----> 3 reg = DecisionTreeRegressor().fit(homes.drop("price", axis=1), homes["price"]) 4 reg /opt/conda/lib/python3.10/site-packages/sklearn/base.py in ?(estimator, *args, **kwargs) 1148 skip_parameter_validation=( 1149 prefer_skip_nested_validation or global_skip_validation 1150 ) 1151 ): -> 1152 return fit_method(estimator, *args, **kwargs) /opt/conda/lib/python3.10/site-packages/sklearn/tree/_classes.py in ?(self, X, y, sample_weight, check_input) 1316 self : DecisionTreeRegressor 1317 Fitted estimator. 1318 """ 1319 -> 1320 super()._fit( 1321 X, 1322 y, 1323 sample_weight=sample_weight, /opt/conda/lib/python3.10/site-packages/sklearn/tree/_classes.py in ?(self, X, y, sample_weight, check_input, missing_values_in_feature_mask) 238 check_X_params = dict( 239 dtype=DTYPE, accept_sparse="csc", force_all_finite=False 240 ) 241 check_y_params = dict(ensure_2d=False, dtype=None) --> 242 X, y = self._validate_data( 243 X, y, validate_separately=(check_X_params, check_y_params) 244 ) 245 /opt/conda/lib/python3.10/site-packages/sklearn/base.py in ?(self, X, y, reset, validate_separately, cast_to_ndarray, **check_params) 613 # :( 614 check_X_params, check_y_params = validate_separately 615 if "estimator" not in check_X_params: 616 check_X_params = {**default_check_params, **check_X_params} --> 617 X = check_array(X, input_name="X", **check_X_params) 618 if "estimator" not in check_y_params: 619 check_y_params = {**default_check_params, **check_y_params} 620 y = check_array(y, input_name="y", **check_y_params) /opt/conda/lib/python3.10/site-packages/sklearn/utils/validation.py in ?(array, accept_sparse, accept_large_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, ensure_min_samples, ensure_min_features, estimator, input_name) 912 ) 913 array = xp.astype(array, dtype, copy=False) 914 else: 915 array = _asarray_with_order(array, order=order, dtype=dtype, xp=xp) --> 916 except ComplexWarning as complex_warning: 917 raise ValueError( 918 "Complex data not supported\n{}\n".format(array) 919 ) from complex_warning /opt/conda/lib/python3.10/site-packages/sklearn/utils/_array_api.py in ?(array, dtype, order, copy, xp) 376 # Use NumPy API to support order 377 if copy is True: 378 array = numpy.array(array, order=order, dtype=dtype) 379 else: --> 380 array = numpy.asarray(array, order=order, dtype=dtype) 381 382 # At this point array is a NumPy ndarray. We convert it to an array 383 # container that is consistent with the input's namespace. /opt/conda/lib/python3.10/site-packages/pandas/core/generic.py in ?(self, dtype) 2082 def __array__(self, dtype: npt.DTypeLike | None = None) -> np.ndarray: 2083 values = self._values -> 2084 arr = np.asarray(values, dtype=dtype) 2085 if ( 2086 astype_is_view(values.dtype, arr.dtype) 2087 and using_copy_on_write() ValueError: could not convert string to float: 'NY'
# how to use "city" column?
homes_with_city_encoded = pd.get_dummies(homes, columns=["city"], dummy_na=True)
homes_with_city_encoded
beds | bath | price | year_built | sqft | price_per_sqft | elevation | city_NY | city_SF | city_nan | |
---|---|---|---|---|---|---|---|---|---|---|
0 | 2.0 | 1.0 | 999000 | 1960 | 1000 | 999 | 10 | True | False | False |
1 | 2.0 | 2.0 | 2750000 | 2006 | 1418 | 1939 | 0 | True | False | False |
2 | 2.0 | 2.0 | 1350000 | 1900 | 2150 | 628 | 9 | True | False | False |
3 | 1.0 | 1.0 | 629000 | 1903 | 500 | 1258 | 9 | True | False | False |
4 | 0.0 | 1.0 | 439000 | 1930 | 500 | 878 | 10 | True | False | False |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
487 | 5.0 | 2.5 | 1800000 | 1890 | 3073 | 586 | 76 | False | True | False |
488 | 2.0 | 1.0 | 695000 | 1923 | 1045 | 665 | 106 | False | True | False |
489 | 3.0 | 2.0 | 1650000 | 1922 | 1483 | 1113 | 106 | False | True | False |
490 | 1.0 | 1.0 | 649000 | 1983 | 850 | 764 | 163 | False | True | False |
491 | 3.0 | 2.0 | 995000 | 1956 | 1305 | 762 | 216 | False | True | False |
492 rows × 10 columns
reg = DecisionTreeClassifier(max_depth=3).fit(homes_with_city_encoded.drop(columns=["price"]), homes_with_city_encoded["price"])
reg
DecisionTreeClassifier(max_depth=3)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
DecisionTreeClassifier(max_depth=3)
plt.figure(dpi=300)
plot_tree(
reg,
label="root",
filled=True,
impurity=False,
proportion=True,
rounded=False
);
Consider each of the following tasks and answer whether they would be best modeled as classification or regression.
Predict whether an email is spam or not spam.
Classification, since the target value is the category "spam" or "not spam".
Predict the number of views a video will receive based on subscriber count.
Regression, since the target value is a number.
Predict the next word to appear in a sentence.
Classification, since the target value is to choose from the dictionary of all possible next words.
If you're curious about how ChatGPT works, Jay Mody has a good introduction to GPT in 60 Lines of NumPy.