Stochastic Gradient Descent (SGD), a simple modification applied to the gradient descent algorithm that we saw from previous week's section. SGD also computes the gradient but updates the weight matrix W on small batches of training data, instead of the entire training set itself.
While this leads to “noiser” weight updates, it also allows us to take more steps along the gradient (1 step for each batch vs. 1 step per epoch), ultimately leading to faster convergence and no negative affects to loss and classification accuracy.
Here is a quick comparison table between Stochastic Gradient Descent and gradient descent:
Algorithm | Time per iter | Total time to convergence for large data in theory |
Total time to convergence for large data in practice |
Sensitivity to params |
---|---|---|---|---|
Gradient Descent |
Slow for large data | Slower | Usually slower | Moderate |
Stochastic Gradient |
Always fast | Faster | Usually faster | Very high |
Now, let's consider an example where we illustrate the use of stochastic gradient descent for linear regression.
Our goal is to find the equation of the straight line $h_\theta(x) = \theta_0 + \theta_1 x$ that best fits our data points. The function that we are trying to minimize in this case is:
$J(\theta_0,\theta_1) = {1 \over 2m} \sum\limits_{i=1}^m (h_\theta(x_i)-y_i)^2$
In this case, our gradient will be defined in two dimensions:
$\frac{\partial}{\partial \theta_0} J(\theta_0,\theta_1) = \frac{1}{m} \sum\limits_{i=1}^m (h_\theta(x_i)-y_i)$
$\frac{\partial}{\partial \theta_1} J(\theta_0,\theta_1) = \frac{1}{m} \sum\limits_{i=1}^m ((h_\theta(x_i)-y_i) \cdot x_i)$
Then we followed this algorithm (where $\alpha$ was a non-adapting stepsize):
1: Choose initial guess $x_0$
2: for k = 0, 1, 2, ... do
3: $s_k$ = -$\nabla f(x_k)$
4: $x_{k+1} = x_k + \alpha s_k$
5: end for
Here is the stochastic gradient descent algorithm for simple linear regression, where m is the size of the data set:
1: Randomly shuffle the data set
2: for k = 0, 1, 2, ... do
3: for i = 1 to m do
4: $\begin{bmatrix}
\theta_0 \\
\theta_1 \\
\end{bmatrix}=\begin{bmatrix}
\theta_0 \\
\theta_1 \\
\end{bmatrix}-\alpha\begin{bmatrix}
2(h_\theta(x_i)-y_i) \\
2x_i(h_\theta(x_i)-y_i) \\
\end{bmatrix}$
5: end for
6: end for
Typically, with stochastic gradient descent, you will run through the entire data set 1 to 10 times (see value for k in line 2 of the pseudocode above), depending on how fast the data is converging and how large the data set is.
In the example below, we'll create a set of 50,000 points around the line 𝑦 = 7+ 3𝑥 + 𝜖 , for values of x between 0 and 100:
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
np.random.seed(42)
f = lambda x: 7 + x * 3 + np.random.randn(len(x)) * 10
x = np.random.random(50000) * 100
y = f(x)
m = len(y)
batch = 100 # batch size
First, let's randomly shuffle around our dataset. Note that in this example, this step isn't strictly necessary since the data is already in a random order. However, that obviously may not always be the case:
Now we'll setup our h function and our cost function, which we will use to check how the value is improving.
h = lambda theta_0, theta_1, x: theta_0 + theta_1 * x
cost = lambda theta_0, theta_1, x_i, y_i: 0.5 * (h(theta_0, theta_1, x_i) - y_i) ** 2
Now we'll run our stochastic gradient descent algorithm. To see it's progress, we'll take a cost measurement at every step. Every 10,000 steps, we'll get an average cost from the last 10,000 steps and then append that to our cost_list variable. We will run through the entire list 10 times here:
theta_old = np.array([0.,0.])
theta_new = np.array([1.,1.]) # The algorithm starts at [1,1]
lr = 0.00005 # learning rate
batch = 1 # batch size
iter_num = 0
s_k = np.array([float("inf"), float("inf")])
sum_cost = 0
cost_list = []
for i in range(m * 20):
iter_num += 1
# random choice of matching x and y
idx = np.random.randint(0, m, size=batch)
x_i = x[idx] # random choice of x
y_i = y[idx] # random choice of y
theta_old = theta_new
pred_y = h(theta_old[0], theta_old[1], x_i)
s_k[0] = pred_y - y_i
s_k[1] = (pred_y - y_i) * x_i
s_k = (-1) * s_k
theta_new = theta_old + lr * s_k
# saved for plot
sum_cost += cost(theta_old[0], theta_old[1], x_i, y_i)
if (i + 1) % 10000 == 0:
cost_list.append(sum_cost / 10000.0)
sum_cost = 0
print("Local minimum occurs where:")
print("theta_0 =", theta_new[0])
print("theta_1 =", theta_new[1])
Local minimum occurs where: theta_0 = 7.07789735537464 theta_1 = 3.0368460823859147
As you can see, our values for $\theta_0$ and $\theta_1$ are close to their true values of 7 and 3.
Now, we plot our cost versus the number of iterations. As you can see, the cost goes down quickly at first, but starts to level off as we go through more iterations:
iterations = np.arange(len(cost_list)) * 10000
plt.plot(iterations,cost_list)
plt.xlabel("iterations")
plt.ylabel("avg cost")
plt.show()
Well, that was kind of cool, but I wish that I didn't need to calculate the gradients myself before I can apply gradient descent or stochastic gradient descent. I wonder if there is some tool that can do that for whichever functions that I am interested in... Pytorch comes to help!
So let's use the same example from above and use Pytorch to help us calculate gradients automatically.
import torch
torch.manual_seed(42)
lr = 0.00005 # learning rate or step size
# let's get our data points x and y into the expected format
x_tensor = torch.from_numpy(x).float()
y_tensor = torch.from_numpy(y).float()
# theta_0, theta_1
theta = torch.tensor([1., 1.], requires_grad=True, dtype=torch.float)
iter_num = 0
cost_list = []
# stopping condition based on iteration count
for i in range(m * 5):
iter_num += 1
# random choice of matching x and y
idx = np.random.randint(0, m, size=batch)
x_i = x_tensor[idx]
y_i = y_tensor[idx]
yhat = theta[0] + theta[1] * x_i
error = y_i - yhat
loss = torch.mul(0.5, (error ** 2)).mean()
sum_cost += float(loss.data)
loss.backward() # this is the magic
theta.data -= lr * theta.grad
theta.grad.detach()
theta.grad.zero_()
# saved for plot
if (i + 1) % 10000 == 0:
cost_list.append(sum_cost / 10000.0)
sum_cost = 0
print("Local minimum occurs where:")
print("theta_0 =", float(theta[0].data))
print("theta_1 =", float(theta[1].data))
Local minimum occurs where: theta_0 = 6.781114101409912 theta_1 = 2.968665599822998
iterations = np.arange(len(cost_list)) * 10000
plt.plot(iterations,cost_list)
plt.xlabel("iterations")
plt.ylabel("avg cost")
plt.show()
The loss above really fluctuates (noisy) due to random choices of x and y at each iteration (epoch).
If you would like to learn about Pytorch, feel free to check out its documentation and more tutorials here. We will be discussing more Pytorch in depth in later weeks of the quarter when we talk about neural networks, so stay tuned and excited!