Calculating Best-fit Lines In Python & Example¶
Note
We provide the function to calculate best-fit lines for you in the starter code. While you are encouraged to read through and understand this, it’s not required.
What is a best-fit line and why do we want it? Well a large part of being good data scientists is collecting and representing data in ways that are accessible and understandable for the population. One common and important method of exploring data is through calculating a line of best fit. A line of best fit is a line that passes through a group of data points that closest ‘fits’ the trend of the data. We can use this line by plotting it and making observations, interpolate, and extrapolate our data. Let’s look into how we can calculate these lines in Python!
Suppose we had three 2-dimensional points, (-1, -1), (0, 3), and (2, 2). We want to find a line that best fits these points. You can visualize this as finding a line that minimizes the sum of the distances between the points and the line. As you hopefully recall from algebra, the equation for a line is typically represented as y = mx + b, where m is the slope and b is the y-intercept.
Instead of having you calculate a best-fit line by hand, we use a library to calculate it for us. numpy is a huge collection of functions and tools for doing a wide range of mathematical operations, including linear algebra. One of the functions, polyfit takes in a list of x values and a list of y values and returns values we can use as b and m. With the example points above, we’d get those values using the following code:
from numpy.polynomial import polynomial as poly
xs = [-1, 0, 2]
ys = [-1, 3, 2]
b, m = poly.polyfit(xs, ys, 1)
A few things to note here:
- We import the polynomial module from the numpy.polynomial package and assign it to the name poly. This allows us to use a short version of the package name, poly.
- Even though you might often be used to seeing a list of coordinates as x, y pairs (as shown a few paragraphs earlier), numpy expects those to be separated as a list of x values and a list of y values. (Conveniently, this is also what matplotlib expects.)
- The polyfit function takes three parameters. The first two are the lists of x and y values, respectively. The third parameter is how many polynomial terms we want. In this assignment, we’re keeping things relatively simple and only using a linear fit, so we’ll set it to 1.
- The return values for polyfit are b and then m. This is not a typo. Polyfit has far more capabilities than what we’re using it for in this assignment, and a side effect is that numpy flips things around and sees the equation as y = b + mx – the equation’s the same, but b comes before m.
Once we have the coefficients m and b for the best-fit line, we can use the equation y = mx + b to calculate the predicted consumption for any year. In doing so, we should get a plot that looks like the following:
To make this plot line, we need to pass discrete values to matplotlib
(we unfortunately can’t just give it m and b and get a line out of it). We’ll do this in two steps:
- First, calculate a line for the best-fit line based on the results from numpy’s polyfit.
- Second, add to that line the predicted values.
Using our list of x values, we can calculate the appropriate y on the best-fit line. In rough pseudocode, this looks like:
create a new list, prediction_ys
for each x in xs:
y = m * x + b
add y to prediction_ys
Then, we calculate the predictions in a similar manner. We first need to generate the x values that we want a prediction for, and then we can use the same equation we used for the best-fit line to calculate the y values for each prediction.
Remember range is exclusive of the last number, so we need to add one to it.
prediction_xs = list( range(max(xs) + 1, max(xs) + 4) )
for each x in prediction_xs:
y = m * x + b
add y to prediction_ys
And then finally, we can plot the line with the predictions using matplotlib:
plt.plot(xs + prediction_xs, prediction_ys, label='best fit prediction')
References¶
The original data for this assignment was downloaded from Kaggle:
https://www.kaggle.com/datasets/sergegeukjian/fish-and-overfishing
(That Kaggle entry is itself a repost of data from the “rfordatascience” project, which has much easier to read descriptions of the data.)
The *actual* provenance of the data comes from a variety of sources collated from ourworldindata.org:
https://ourworldindata.org/fish-and-overfishing
Each of the charts on that site have links to detailed descriptions of the data as well as references to original sources and raw data downloads.
It has a lot more data than what we’ve included in this assignment, some of which might be interesting. For example:
- https://ourworldindata.org/grapher/regulation-illegal-fishing
- https://ourworldindata.org/grapher/wild-fish-allocation
- https://ourworldindata.org/grapher/fish-landings-and-discards
- https://ourworldindata.org/grapher/bottom-trawling
Others are more summary-oriented and may be useful as additional “background” for the assignment. For example:
- https://ourworldindata.org/grapher/fish-stocks-within-sustainable-levels
- https://ourworldindata.org/grapher/global-aquaculture-wild-fish-feed
- https://ourworldindata.org/grapher/employed-fisheries-aquaculture-time
The “best-fit” calculation is done using a method called “Least Squares”. The result of that calculation is then used in Polynomial Regression. Setting polyfit
’s “deg” parameter (the third in our usage here) to 1
effectively makes it do a Linear Regression.
These are rather advanced topics, and so we do not expect you to read further into these. But the reading is provided here should you be interested or at least want to know the formal terms for what we’ve done here.