CSE 546 Review Session

October 8: Bias-variance trade-off and L1/L2 regularization comparison

In this review section, we'll work through an example to review the bias-variance tradeoff in machine learning. We will later look at some similarities and differences between L1 and L2 regularization.

Bias-Variance Trade-Off

For this example, assume pairs of data $(x_i, y_i) \in (\mathbb{R}, \mathbb{R})$ are generated from the model $y_i = f^\ast(x_i) + \epsilon_i $. Each $\epsilon_i$ is independently drawn from a Gaussian distribution with mean 0 and standard deviation $\sigma$. For these examples, we will fix $\sigma = 0.1$ and $f^*(x) = 3 x^3 - x$, and we assume each $x_i$ is drawn from a uniform distribution between 0 and 1.

The code below generates a scatter plot of 10 datapoints drawn from this distribution.

In [4]:
import numpy as np
import matplotlib.pyplot as plt

sigma = 0.1

def f_star(x):
    return 3 * x**3 - x

def make_data(N):
    """Generates N pairs of data according to y = f*(x) + epsilon"""
    x = np.random.rand(N)
    noise = sigma * np.random.randn(N)
    y = f_star(x) + noise
    return (x, y)

# Generate data:
exampleData = make_data(10)

# Plot f_star:
domainSamples = np.linspace(0, 1)
plt.plot(domainSamples, f_star(domainSamples), 'r')

# Plot data:
plt.scatter(exampleData[0], exampleData[1])

# Format/show plot:
plt.xlabel('x')
plt.ylabel('y')
plt.legend(('f*', 'Samples'), loc='upper left')
plt.show()

Using this model, we will now explore the bias-variance tradeoff using polynomial regression. Given some training data and an order k, we will learn a kth order polynomial function $\hat{f}$ that maps each $x_i$ to a prediction $\hat{y}_i$. The following code illustrates polynomial regression, learning a 4th degree polynomial to fit our 10 example data points.

In [5]:
def expand(x, order):
    """Returns a 2d array with (order + 1) columns.
    Column i is equal to x raised to the power i"""
    cols = []
    for i in range(order + 1):
        cols.append(x ** i)
    return np.column_stack(cols)

def polynomial_regression((x, y), order):
    """Returns a function learned by polynomial regression of order order"""
    lsResult = np.linalg.lstsq(expand(x, order), y)
    lsCoefficients = lsResult[0]
    return lambda v : np.dot(expand(v, order), lsCoefficients)

f_hat = polynomial_regression(exampleData, order=4)

# Plot f_star:
plt.plot(domainSamples, f_hat(domainSamples), 'g')

# Plot data:
plt.scatter(exampleData[0], exampleData[1])

# Format/show plot:
plt.xlabel('x')
plt.ylabel('y')
plt.legend(('f_hat', 'Samples'), loc='upper left')
plt.show()

Minimizing Expected Error

Now suppose that our goal is to minimize the expected squared error on some test data. That is, we want to find an $\hat{f}$ that minimizes $R(\hat{f}) = \mathbb{E}_X[(\hat{f}(x) - y)^2] = \mathbb{E}_X[(\hat{f}(x) - f^*(x) - \epsilon)^2]$, where we're taking the expectation over the distribution of $X$.

Since $\epsilon$ is indenpendent from $x$, we can expand $R$ to obtain: $R(\hat{f}) = \mathbb{E}_X[(\hat{f}(x) - f^\ast(x))^2] + \sigma^2$. As a result of our noisy true model, we see that no matter what we choose for $\hat{f}$, we can never expect a loss less than $\sigma^2$.

In machine learning, we learn $\hat{f}$ from data, usually by minimizing the empirical risk on some training data. We assume our training dataset has been created for us by generating random i.i.d. samples $(x_i, y_i)$, and as a result, $\hat{f}$ itself is also random. Thus, the quantity we're actually interested in is $\mathbb{E}_D [\mathbb{E}_X [(\hat{f}(x) - f^*(x))^2]] + \sigma^2$, taking the expected value over a distribution on training data $D$ as well.

In class we saw that the first term can be decomposed into a bias and variance term. Letting $h(x) = \mathbb{E}_D[\hat{f}(x)]$, the function we expect to learn, we end up with

$\mathbb{E}_D [R(\hat{f})] = \mathbb{E}_D \mathbb{E}_X [[(f^*(x) - h(x))^2]] + \mathbb{E}_D \mathbb{E}_X [[(\hat{f}(x) - h(x))^2]] + \sigma^2$

This correspondes to a bias term, a variance term, and a noise term we cannot avoid.

A Biased Hypothesis Class

Let's explore the case when our model of $\hat{f}$ is not complex enough to match the true function $f^\ast$. Specifically, we'll restrict $\hat{f}$ to the form $\hat{f}(x) = mx + b$ by learning only a first degree polynomial. We'll vary the size of the training set, and for each $\hat{f}$, estimate $R(\hat{f})$ empircally with some test data.

In [7]:
# Some test data to empirically evaluate squared error:
testData = make_data(10000)

def get_mse(predictions, testData):
    """Evaluates mean squared error between predictions and test labels"""
    yTest = testData[1]
    return sum((predictions - yTest) ** 2)/len(yTest)

# Learn a linear model with increasing amounts of training data:
N_ar = range(2, 1000)
mse_ar = []
for N in N_ar:
    # Generate N data points:
    trainData = make_data(N) 
    
    # Perform linear regression:
    f_hat = polynomial_regression(trainData, order=1)
    
    # Evaluate:
    mse = get_mse(f_hat(testData[0]), testData)
    mse_ar.append(mse)
        
plt.semilogx(N_ar, mse_ar)
plt.ylim(0, 1.1*max(mse_ar))
plt.xlabel('Size of Training Set')
plt.ylabel('MSE')
plt.title('Order-1 Polynomial Regression')
plt.show()

print "Error with", N_ar[-1], "training samples:", mse_ar[-1]
Error with 999 training samples: 0.126959275764

Note that even as the size of the training set becomes large, our expected error does not fall below 0.12. This is far from the optimal $\hat{f} = f^\ast$, where the error would approach $\sigma^2 = 0.01$. This result is due to bias. Our model is not complex enough to accurately approximate $f^\ast$.

Key idea: Even as the amount of training data becomes large, we are still far from optimal performance because of bias.

A More Complex Hypothesis Class

Next we will learn a 4th degree polynomial, but restrict our training set size to only 10 data points. Now the bias is 0, but we are learning 5 coefficients with only 10 datapoints. For comparision, we will also learn a 1st degree polynomial like before. How do you think the results will compare?

In [11]:
# Learn a kth degree polynomial with 10 data points
mse_ar_order1 = []
mse_ar_orderk = []
k = 4 # order of more complex polynomial
N = 10
numTrials = 1000
for trial in range(numTrials):
    # Generate N data points:
    trainData = make_data(N) 
    
    # Perform linear regression:
    f_hat_order1 = polynomial_regression(trainData, order=1)
    f_hat_orderk = polynomial_regression(trainData, order=k)
    
    # Evaluate:
    mse_order1 = get_mse(f_hat_order1(testData[0]), testData)
    mse_ar_order1.append(mse_order1)
    
    mse_orderk = get_mse(f_hat_orderk(testData[0]), testData)
    mse_ar_orderk.append(mse_orderk)
 
meanError_order1 = sum(mse_ar_order1)/numTrials
meanError_orderk = sum(mse_ar_orderk)/numTrials
 
print "Average MSE, 1st degree polynomial:", meanError_order1
print "Average MSE, kth degree polynomial:", meanError_orderk
Average MSE, 1st degree polynomial: 0.173445439906
Average MSE, kth degree polynomial: 1.20624934823

Whoa, the 1st degree polynomial is still better. Why? This is because even though the more complex model now has 0 bias, it large variance. By training on only 10 data points, we still expect the learned model $\hat{f}$ to differ greatly from $f^*$ - even more so than a simpler but biased model. Try increasing the number of training examples N to see what happens then.

Key idea: Increasing the complexity of our model may decrease error due to bias. However, without enough training data, we may also be increasing error due to variance! Achieving the right balance is an important goal (and usually taken care of by cross validation).

L1 vs. L2 Regularization

Least squares with L1 Regularization (Lasso): $\min_{\mathbf{w}, w_0} \sum_i^N (\mathbf{w}^T \mathbf{x_i} + w_0 - y_i)^2 + \lambda ||\mathbf{w}||_1$

Least squares L2 Regularization (Ridge Regression): $\min_{\mathbf{w}, w_0} \sum_i^N (\mathbf{w}^T \mathbf{x_i} + w_0 - y_i)^2 + \lambda||\mathbf{w}||^2_2$

L1, L2, or both:

Answers:

Key idea: When there are "too many" parameters to estimate, we usually assume some "structure" to direct our solution. With Lasso, this structure is sparsity. With ridge regression, the structure is small weights. In both cases, we avoid overfitting.

In []: