Machine Learning PracticeΒΆ

In this section, we will be predicting the magnitude of Earthquakes using LinearRegression and DecisionTreeRegressor.

Dataset: Earthquakes and CountriesΒΆ

For today's activity, we will be utilizing the earthquake data from the lectures on pandas!

  • Earthquake: id (str), year (int), month (int), day (int), latitude (float), longitude (float), name (str), magnitude (float)
  • Countries: POP_EST (float), GDP_MD (int), CONTINENT (str), SUBREGION (str), geometry (geometry)
InΒ [1]:
import pandas as pd
import geopandas as gpd
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor, plot_tree
from sklearn.metrics import mean_squared_error
InΒ [3]:
# Load in Earthquake data
earthquakes = pd.read_csv("earthquakes.csv").set_index("id")
earthquakes = gpd.GeoDataFrame(
    earthquakes,
    # crs="EPSG:4326" specifies WGS84 or GPS coordinate system, see https://epsg.io/4326
    geometry=gpd.points_from_xy(earthquakes["longitude"], earthquakes["latitude"], crs="EPSG:4326")
)
earthquakes["month"] = earthquakes["month"].astype('category')

# Load in Country data
columns = ["POP_EST", "GDP_MD", "CONTINENT", "SUBREGION", "geometry"]
countries = gpd.read_file("ne_110m_admin_0_countries.shp").set_index("NAME")[columns]
earthquakes
Out[3]:
year month day latitude longitude name magnitude geometry
id
nc72666881 2016 7 27 37.672333 -121.619000 California 1.43 POINT (-121.619 37.67233)
us20006i0y 2016 7 27 21.514600 94.572100 Burma 4.90 POINT (94.5721 21.5146)
nc72666891 2016 7 27 37.576500 -118.859167 California 0.06 POINT (-118.85917 37.5765)
nc72666896 2016 7 27 37.595833 -118.994833 California 0.40 POINT (-118.99483 37.59583)
nn00553447 2016 7 27 39.377500 -119.845000 Nevada 0.30 POINT (-119.845 39.3775)
... ... ... ... ... ... ... ... ...
nc72685246 2016 8 25 36.515499 -121.099831 California 2.42 POINT (-121.09983 36.5155)
ak13879193 2016 8 25 61.498400 -149.862700 Alaska 1.40 POINT (-149.8627 61.4984)
nc72685251 2016 8 25 38.805000 -122.821503 California 1.06 POINT (-122.8215 38.805)
ci37672328 2016 8 25 34.308000 -118.635333 California 1.55 POINT (-118.63533 34.308)
ci37672360 2016 8 25 34.119167 -116.933667 California 0.89 POINT (-116.93367 34.11917)

8394 rows Γ— 8 columns

Before we continue with creating our models, let's take a look at where our Earthquake data is. This will be important later when we interpret the results of our model. Run the cell below to see our plot!

InΒ [3]:
fig, ax = plt.subplots(figsize=(13, 5))
countries.plot(ax=ax, color="#EEE")
earthquakes.plot(ax=ax, column="magnitude", markersize=0.1, legend=True)
ax.set(title="Earthquakes between July 27, 2016 and August 25, 2016")
ax.set_axis_off()
No description has been provided for this image

Predicting Earthquake MagnitudeΒΆ

Now that we have a decent sense of the data we are working with, let's start our main task: predicting earthquake magnitudes using a LinearRegression model!

Iteration 1: Using longitude and latitudeΒΆ

For our first model, let's try using longitude and latitude. It seems the most intuitive, so let's start from there!

Linear Regression Using Longitude and LatitudeΒΆ

Below, fill in the code cells to perform each task.

First, we need to split our Earthquake data into our training and testing set. Here, our

  • Features contain [longitude, latitude],
  • Label is magnitude,
  • test_size=0.2
InΒ [5]:
# TODO: Split the Earthquake data into the training, testing sets
X = earthquakes[["longitude", "latitude"]]
y = earthquakes["magnitude"]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

Since we have now done our train-test split, all we need to do is fit and analyze the model!

InΒ [7]:
# TODO: Fit the model
regr_coords = LinearRegression().fit(X_train, y_train)
y_predict = regr_coords.predict(X_test)
print("Mean squared error " + str(mean_squared_error(y_test, y_predict)))



# Do not alter
grid = sns.relplot(x=y_test, y=y_predict)
grid.set(title="Predicted Magnitude v. Observed Magnitude",
       xlabel="Observed Magnitude (test data)",
       ylabel="Predicted Magnitude (predictions)",
       yticks=list(range(0, 8)), xticks=list(range(0, 8)))
grid.ax.axline((0, 0), slope=1, color='k', ls='--')
Mean squared error 0.7892207373455713
Out[7]:
<matplotlib.lines.AxLine at 0x7e599a3687d0>
No description has been provided for this image

Iteration 2: Using longitude, latitude, name, and monthΒΆ

For our second model, let's try incorporating some of our categorical variables in addition to longitude and latitude. Because we'll be incorporating categorical variables, it's unlikely that the relationship between magnitude (our output) and longitude, latitude, name, and month (our inputs) will be able to be represented by a simple linear relationship. Instead, let's use a non-linear regression model: a DecisionTreeRegressor! Note that because name and month are not numeric, we'll need to create dummy variables so sklearn can handle them properly.

Decision Tree Regression Using Longitude, Latitude, Name, and MonthΒΆ

Below, fill in the code cells to perform each task.

First, as before, we need to filter for our features and label, and then do train-test split. Here, our

  • Features contain [longitude, latitude, name, month]
  • Label is magnitude,
  • test_size=0.2

However, unlike before, we also need to create "dummy" variables (one-hot encode) for our features since they contain categorical data.

InΒ [13]:
# TODO: Create "dummy" variables for categorical features
X = earthquakes[["longitude", "latitude", "name", "month"]]
Y = earthquakes["magnitude"]
X = pd.get_dummies(X)
InΒ [14]:
# TODO: Split the Earthquake data into the training testing sets
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size = 0.2)

Now, let's fit and analyze the model!

InΒ [16]:
# TODO: Fit the model
model = DecisionTreeRegressor()
model.fit(X_train, y_train)
predictions = model.predict(X_test)
print("Mean squared error " + str(mean_squared_error(y_test, predictions)))


# Do not alter
grid = sns.relplot(x=y_test, y=predictions)
grid.set(title="Observed Magnitude v. Predicted Magnitude",
       xlabel='Observed Magnitude (test data)',
       ylabel='Predicted Magnitude (predictions)',
       yticks=list(range(0, 8)), xticks=list(range(0, 8)))
grid.ax.axline((0, 0), slope=1, color='k', ls='--')
Mean squared error 0.4364461865528423
Out[16]:
<matplotlib.lines.AxLine at 0x7e599a258690>
No description has been provided for this image

Next, let's visualize the decision tree that we used for our model! It's okay if you don't understand all of the code. This will serve as a step-by-step of how our model makes decisions. Since the max_depth is set to 2, we don't see all of the decisions that our model makes–that's okay!

InΒ [17]:
plt.figure(dpi=300)
plot_tree(
    model,
    feature_names=X.columns,
    label="root",
    filled=True,
    impurity=False,
    proportion=True,
    rounded=False,
    max_depth=2,
    fontsize=5
) and None # Hide return value of plot_tree
No description has been provided for this image

DebriefΒΆ

How did our models do at predicting the magnitude of our earthquakes?

Why do you think this was the case?