Stochastic Gradient Descent

Stochastic Gradient Descent (SGD), a simple modification applied to the gradient descent algorithm that we saw from previous week's section. SGD also computes the gradient but updates the weight matrix W on small batches of training data, instead of the entire training set itself.

While this leads to “noiser” weight updates, it also allows us to take more steps along the gradient (1 step for each batch vs. 1 step per epoch), ultimately leading to faster convergence and no negative affects to loss and classification accuracy.

Here is a quick comparison table between Stochastic Gradient Descent and gradient descent:

Algorithm Time per iter Total time to convergence
for large data in theory
Total time to convergence
for large data in practice
Sensitivity to params
Gradient
Descent
Slow for large data Slower Usually slower Moderate
Stochastic
Gradient
Always fast Faster Usually faster Very high

Now, let's consider an example where we illustrate the use of stochastic gradient descent for linear regression.

Our goal is to find the equation of the straight line $h_\theta(x) = \theta_0 + \theta_1 x$ that best fits our data points. The function that we are trying to minimize in this case is:

$J(\theta_0,\theta_1) = {1 \over 2m} \sum\limits_{i=1}^m (h_\theta(x_i)-y_i)^2$

In this case, our gradient will be defined in two dimensions:

$\frac{\partial}{\partial \theta_0} J(\theta_0,\theta_1) = \frac{1}{m} \sum\limits_{i=1}^m (h_\theta(x_i)-y_i)$

$\frac{\partial}{\partial \theta_1} J(\theta_0,\theta_1) = \frac{1}{m} \sum\limits_{i=1}^m ((h_\theta(x_i)-y_i) \cdot x_i)$

Then we followed this algorithm (where $\alpha$ was a non-adapting stepsize):

    1:   Choose initial guess $x_0$
    2:   for k = 0, 1, 2, ... do
    3:       $s_k$ = -$\nabla f(x_k)$
    4:       $x_{k+1} = x_k + \alpha s_k$
    5:   end for

Here is the stochastic gradient descent algorithm for simple linear regression, where m is the size of the data set:

    1:   Randomly shuffle the data set
    2:   for k = 0, 1, 2, ... do
    3:       for i = 1 to m do
    4:            $\begin{bmatrix} \theta_0 \\ \theta_1 \\ \end{bmatrix}=\begin{bmatrix} \theta_0 \\ \theta_1 \\ \end{bmatrix}-\alpha\begin{bmatrix} 2(h_\theta(x_i)-y_i) \\ 2x_i(h_\theta(x_i)-y_i) \\ \end{bmatrix}$
    5:       end for
    6:   end for

Typically, with stochastic gradient descent, you will run through the entire data set 1 to 10 times (see value for k in line 2 of the pseudocode above), depending on how fast the data is converging and how large the data set is.

In the example below, we'll create a set of 50,000 points around the line 𝑦 = 7+ 3𝑥 + 𝜖 , for values of x between 0 and 100:

In [3]:
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
In [4]:
np.random.seed(42)
x = np.random.random(50000) * 100
noise = np.random.randn(len(x)) * 10
f = lambda x: 7 + x * 3 + noise

y = f(x) 
m = len(y)
batch = 100 # batch size

First, let's randomly shuffle around our dataset. Note that in this example, this step isn't strictly necessary since the data is already in a random order. However, that obviously may not always be the case:

Now we'll setup our h function and our cost function, which we will use to check how the value is improving.

In [5]:
h = lambda theta_0, theta_1, x: theta_0 + theta_1 * x
cost = lambda theta_0, theta_1, x_i, y_i: 0.5 * (h(theta_0, theta_1, x_i) - y_i) ** 2

Now we'll run our stochastic gradient descent algorithm. To see it's progress, we'll take a cost measurement at every step. Every 10,000 steps, we'll get an average cost from the last 10,000 steps and then append that to our cost_list variable. We will run through the entire list 10 times here:

In [6]:
theta_old = np.array([0.,0.])
theta_new = np.array([1.,1.]) # The algorithm starts at [1,1]
lr = 0.00005 # learning rate
batch = 1 # batch size

iter_num = 0
s_k = np.array([float("inf"), float("inf")])
sum_cost = 0
cost_list = []

for i in range(m * 20):
    iter_num += 1
    # random choice of matching x and y
    idx = np.random.randint(0, m, size=batch)
    x_i = x[idx]  # random choice of x
    y_i = y[idx]  # random choice of y
    
    theta_old = theta_new
    pred_y = h(theta_old[0], theta_old[1], x_i)
    s_k[0] = pred_y - y_i
    s_k[1] = (pred_y - y_i) * x_i
    s_k = (-1) * s_k
    theta_new = theta_old + lr * s_k
    
    # saved for plot
    sum_cost += cost(theta_old[0], theta_old[1], x_i, y_i)
    if (i + 1) % 10000 == 0:
        cost_list.append(sum_cost / 10000.0)
        sum_cost = 0   
            
print("Local minimum occurs where:")
print("theta_0 =", theta_new[0])
print("theta_1 =", theta_new[1])
Local minimum occurs where:
theta_0 = 7.07789735537464
theta_1 = 3.0368460823859147

As you can see, our values for $\theta_0$ and $\theta_1$ are close to their true values of 7 and 3.

Now, we plot our cost versus the number of iterations. As you can see, the cost goes down quickly at first, but starts to level off as we go through more iterations:

In [7]:
iterations = np.arange(len(cost_list)) * 10000
plt.plot(iterations,cost_list)
plt.xlabel("iterations")
plt.ylabel("avg cost")
plt.show()

Well, that was kind of cool, but I wish that I didn't need to calculate the gradients myself before I can apply gradient descent or stochastic gradient descent. I wonder if there is some tool that can do that for whichever functions that I am interested in... Pytorch comes to help!

So let's use the same example from above and use Pytorch to help us calculate gradients automatically.

In [8]:
import torch

torch.manual_seed(42)

lr = 0.00005 # learning rate or step size

# let's get our data points x and y into the expected format
x_tensor = torch.from_numpy(x).float()
y_tensor = torch.from_numpy(y).float()

# theta_0, theta_1
theta = torch.tensor([1., 1.], requires_grad=True, dtype=torch.float)

iter_num = 0
cost_list = []

# stopping condition based on iteration count
for i in range(m * 5):
    iter_num += 1
    # random choice of matching x and y
    idx = np.random.randint(0, m, size=batch)
    x_i = x_tensor[idx]
    y_i = y_tensor[idx]
    
    yhat = theta[0] + theta[1] * x_i
    error = y_i - yhat
    loss = (error ** 2).mean()
    sum_cost += float(loss.data)

    loss.backward() # this is the magic

    theta.data -= lr * theta.grad
    theta.grad.detach()
    theta.grad.zero_()
    
    # saved for plot
    if (i + 1) % 10000 == 0:
        cost_list.append(sum_cost / 10000.0)
        sum_cost = 0
            
print("Local minimum occurs where:")
print("theta_0 =", float(theta[0].data))
print("theta_1 =", float(theta[1].data))
Local minimum occurs where:
theta_0 = 7.0001702308654785
theta_1 = 2.9323341846466064
In [9]:
iterations = np.arange(len(cost_list)) * 10000
plt.plot(iterations,cost_list)
plt.xlabel("iterations")
plt.ylabel("avg cost")
plt.show()

The loss above really fluctuates (noisy) due to random choices of x and y at each iteration (epoch).

If you would like to learn about Pytorch, feel free to check out its documentation and more tutorials here. We will be discussing more Pytorch in depth in later weeks of the quarter when we talk about neural networks, so stay tuned and excited!

In [ ]: