Stochastic Gradient Descent

Stochastic Gradient Descent (SGD), a simple modification applied to the gradient descent algorithm that we saw from previous week's section. SGD also computes the gradient but updates the weight matrix W on small batches of training data, instead of the entire training set itself.

While this leads to “noiser” weight updates, it also allows us to take more steps along the gradient (1 step for each batch vs. 1 step per epoch), ultimately leading to faster convergence and no negative affects to loss and classification accuracy.

Here is a quick comparison table between Stochastic Gradient Descent and gradient descent:

Algorithm Time per iter Total time to convergence
for large data in theory
Total time to convergence
for large data in practice
Sensitivity to params
Gradient
Descent
Slow for large data Slower Usually slower Moderate
Stochastic
Gradient
Always fast Faster Usually faster Very high

Now, let's consider an example where we illustrate the use of stochastic gradient descent for linear regression.

Our goal is to find the equation of the straight line $h_\theta(x) = \theta_0 + \theta_1 x$ that best fits our data points. The function that we are trying to minimize in this case is:

$J(\theta_0,\theta_1) = {1 \over 2m} \sum\limits_{i=1}^m (h_\theta(x_i)-y_i)^2$

In this case, our gradient will be defined in two dimensions:

$\frac{\partial}{\partial \theta_0} J(\theta_0,\theta_1) = \frac{1}{m} \sum\limits_{i=1}^m (h_\theta(x_i)-y_i)$

$\frac{\partial}{\partial \theta_1} J(\theta_0,\theta_1) = \frac{1}{m} \sum\limits_{i=1}^m ((h_\theta(x_i)-y_i) \cdot x_i)$

Then we followed this algorithm (where $\alpha$ was a non-adapting stepsize):

    1:   Choose initial guess $x_0$
    2:   for k = 0, 1, 2, ... do
    3:       $s_k$ = -$\nabla f(x_k)$
    4:       $x_{k+1} = x_k + \alpha s_k$
    5:   end for

Here is the stochastic gradient descent algorithm for simple linear regression, where m is the size of the data set:

    1:   Randomly shuffle the data set
    2:   for k = 0, 1, 2, ... do
    3:       for i = 1 to m do
    4:            $\begin{bmatrix} \theta_0 \\ \theta_1 \\ \end{bmatrix}=\begin{bmatrix} \theta_0 \\ \theta_1 \\ \end{bmatrix}-\alpha\begin{bmatrix} 2(h_\theta(x_i)-y_i) \\ 2x_i(h_\theta(x_i)-y_i) \\ \end{bmatrix}$
    5:       end for
    6:   end for

Typically, with stochastic gradient descent, you will run through the entire data set 1 to 10 times (see value for k in line 2 of the pseudocode above), depending on how fast the data is converging and how large the data set is.

In the example below, we'll create a set of 50,000 points around the line 𝑦 = 7+ 3𝑥 + 𝜖 , for values of x between 0 and 100:

First, let's randomly shuffle around our dataset. Note that in this example, this step isn't strictly necessary since the data is already in a random order. However, that obviously may not always be the case:

Now we'll setup our h function and our cost function, which we will use to check how the value is improving.

Now we'll run our stochastic gradient descent algorithm. To see it's progress, we'll take a cost measurement at every step. Every 10,000 steps, we'll get an average cost from the last 10,000 steps and then append that to our cost_list variable. We will run through the entire list 10 times here:

As you can see, our values for $\theta_0$ and $\theta_1$ are close to their true values of 7 and 3.

Now, we plot our cost versus the number of iterations. As you can see, the cost goes down quickly at first, but starts to level off as we go through more iterations:

Well, that was kind of cool, but I wish that I didn't need to calculate the gradients myself before I can apply gradient descent or stochastic gradient descent. I wonder if there is some tool that can do that for whichever functions that I am interested in... Pytorch comes to help!

So let's use the same example from above and use Pytorch to help us calculate gradients automatically.

The loss above really fluctuates (noisy) due to random choices of x and y at each iteration (epoch).

If you would like to learn about Pytorch, feel free to check out its documentation and more tutorials here. We will be discussing more Pytorch in depth in later weeks of the quarter when we talk about neural networks, so stay tuned and excited!