In this notebook we build on top of the previous notebook, and show how to build, train and evaluate neural networks in PyTorch.
Here we import torch, define helper function for visualization and define dataset for later use.
import torch
import torch.nn as nn
import torch.nn.functional as F
from mpl_toolkits.mplot3d import Axes3D
import matplotlib.pyplot as plt
import numpy as np
torch.manual_seed(446)
np.random.seed(446)
def visualize_fun(w, title, num_pts=20):
x1, x2 = np.meshgrid(np.linspace(-2,2, num_pts), np.linspace(-2,2, num_pts))
X_plane = torch.tensor(np.stack([np.reshape(x1, (num_pts**2)), np.reshape(x2, (num_pts**2))], axis=1)).float()
y_plane = np.reshape((X_plane @ w).detach().numpy(), (num_pts, num_pts))
plt3d = plt.figure().gca(projection='3d')
plt3d.plot_surface(x1, x2, y_plane, alpha=0.2)
ax = plt.gca()
ax.scatter(X[:,0].numpy(), X[:,1].numpy(), y.numpy(), c='r', marker='o')
ax.set_xlabel('$X_1$')
ax.set_ylabel('$X_2$')
ax.set_zlabel('$Y$')
plt.title(title)
plt.show()
# make a simple linear dataset with some noise
d = 2
n = 50
X = torch.randn(n,d)
true_w = torch.tensor([[-1.0], [2.0]])
y = X @ true_w + torch.randn(n,1) * 0.1
print('X shape', X.shape)
print('y shape', y.shape)
print('w shape', true_w.shape)
X shape torch.Size([50, 2]) y shape torch.Size([50, 1]) w shape torch.Size([2, 1])
Module
is PyTorch's way of performing operations on tensors. Modules are implemented as subclasses of the torch.nn.Module
class. All modules are callable and can be composed together to create complex functions.
Note: most of the functionality implemented for modules can be accessed in a functional form via torch.nn.functional
, but these require you to create and manage the weight tensors yourself.
The bread and butter of modules is the Linear module which does a linear transformation with a bias. It takes the input and output dimensions as parameters, and creates the weights in the object.
Unlike how we initialized our $w$ manually, the Linear module automatically initializes the weights randomly. For minimizing non convex loss functions (e.g. training neural networks), initialization is important and can affect results. If training isn't working as well as expected, one thing to try is manually initializing the weights to something different from the default. PyTorch implements some common initializations in torch.nn.init
.
d_in = 3
d_out = 4
linear_module = nn.Linear(d_in, d_out)
example_tensor = torch.tensor([[1.,2,3], [4,5,6]])
# applys a linear transformation to the data
transformed = linear_module(example_tensor)
print('example_tensor', example_tensor.shape)
print('transormed', transformed.shape)
print()
print('We can see that the weights exist in the background\n')
print('W:', linear_module.weight)
print('b:', linear_module.bias)
example_tensor torch.Size([2, 3]) transormed torch.Size([2, 4]) We can see that the weights exist in the background W: Parameter containing: tensor([[ 0.3270, 0.2183, 0.2269], [-0.5094, -0.4306, 0.2483], [-0.0776, -0.5372, 0.0966], [-0.1610, 0.2270, -0.0063]], requires_grad=True) b: Parameter containing: tensor([ 0.1384, -0.1959, -0.2587, 0.0353], requires_grad=True)
PyTorch implements a number of activation functions including but not limited to ReLU
, Tanh
, and Sigmoid
. Since they are modules, they need to be instantiated.
activation_fn = nn.ReLU() # we instantiate an instance of the ReLU module
example_tensor = torch.tensor([-1.0, 1.0, 0.0])
activated = activation_fn(example_tensor)
print('example_tensor', example_tensor)
print('activated', activated)
example_tensor tensor([-1., 1., 0.]) activated tensor([0., 1., 0.])
Many times, we want to compose Modules together. torch.nn.Sequential
provides a good interface for composing simple modules.
d_in = 3
d_hidden = 4
d_out = 1
model = torch.nn.Sequential(
nn.Linear(d_in, d_hidden),
nn.Tanh(),
nn.Linear(d_hidden, d_out),
nn.Sigmoid()
)
example_tensor = torch.tensor([[1.,2,3],[4,5,6]])
transformed = model(example_tensor)
print('transformed', transformed.shape)
transformed torch.Size([2, 1])
Note: we can access all of the parameters (of any nn.Module
) with the parameters()
method.
params = model.parameters()
for param in params:
print(param)
Parameter containing: tensor([[ 0.5478, -0.5734, 0.2589], [ 0.5739, -0.4392, -0.0377], [ 0.2290, 0.0529, 0.4021], [ 0.3153, -0.4802, 0.3067]], requires_grad=True) Parameter containing: tensor([ 0.4905, 0.3743, 0.4069, -0.2514], requires_grad=True) Parameter containing: tensor([[-0.1443, -0.1406, 0.0414, -0.4699]], requires_grad=True) Parameter containing: tensor([0.2149], requires_grad=True)
PyTorch implements many common loss functions including MSELoss
and CrossEntropyLoss
.
mse_loss_fn = nn.MSELoss()
input = torch.tensor([[0., 0, 0]])
target = torch.tensor([[1., 0, -1]])
loss = mse_loss_fn(input, target)
print(loss)
tensor(0.6667)
PyTorch implements a number of gradient-based optimization methods in torch.optim
, including Gradient Descent. At the minimum, it takes in the model parameters and a learning rate.
Optimizers do not compute the gradients for you, so you must call backward()
yourself. You also must call the optim.zero_grad()
function before calling backward()
since by default PyTorch does and inplace add to the .grad
member variable rather than overwriting it.
This does both the detach_()
and zero_()
calls on all tensor's grad
variables.
# create a simple model
model = nn.Linear(1, 1)
# create a simple dataset
X_simple = torch.tensor([[1.]])
y_simple = torch.tensor([[2.]])
# create our optimizer
optim = torch.optim.SGD(model.parameters(), lr=1e-2)
mse_loss_fn = nn.MSELoss()
y_hat = model(X_simple)
print('model params before:', model.weight)
loss = mse_loss_fn(y_hat, y_simple)
optim.zero_grad()
loss.backward()
optim.step()
print('model params after:', model.weight)
model params before: Parameter containing: tensor([[-0.3881]], requires_grad=True) model params after: Parameter containing: tensor([[-0.3603]], requires_grad=True)
As we can see, the parameter was updated in the correct direction
Now let's combine what we've learned to solve linear regression in a "PyTorchic" way.
step_size = 0.1
linear_module = nn.Linear(d, 1, bias=False)
loss_func = nn.MSELoss()
optim = torch.optim.SGD(linear_module.parameters(), lr=step_size)
print('iter,\tloss,\tw')
for i in range(20):
y_hat = linear_module(X)
loss = loss_func(y_hat, y)
optim.zero_grad()
loss.backward()
optim.step()
print('{},\t{:.2f},\t{}'.format(i, loss.item(), linear_module.weight.view(2).detach().numpy()))
print('\ntrue w\t\t', true_w.view(2).numpy())
print('estimated w\t', linear_module.weight.view(2).detach().numpy())
iter, loss, w 0, 3.45, [-0.6277163 0.5246437] 1, 2.23, [-0.6903032 0.81286395] 2, 1.45, [-0.74182737 1.0444099 ] 3, 0.94, [-0.7842876 1.2304163] 4, 0.61, [-0.8193146 1.3798317] 5, 0.40, [-0.8482402 1.4998472] 6, 0.26, [-0.8721527 1.5962417] 7, 0.17, [-0.8919422 1.6736592] 8, 0.11, [-0.9083374 1.7358311] 9, 0.08, [-0.92193544 1.785756 ] 10, 0.05, [-0.93322587 1.825843 ] 11, 0.04, [-0.9426106 1.8580279] 12, 0.03, [-0.95041996 1.8838661 ] 13, 0.02, [-0.95692545 1.9046069 ] 14, 0.02, [-0.9623507 1.9212543] 15, 0.01, [-0.9668801 1.9346145] 16, 0.01, [-0.9706655 1.9453354] 17, 0.01, [-0.97383255 1.953937 ] 18, 0.01, [-0.976485 1.9608375] 19, 0.01, [-0.9787088 1.9663724] true w [-1. 2.] estimated w [-0.9787088 1.9663724]
visualize_fun(linear_module.weight.t(), 'Dataset with learned $w$ (PyTorch GD)')
In the previous examples, we computed the average gradient over the entire dataset (Gradient Descent). We can implement Stochastic Gradient Descent with a simple modification.
step_size = 0.01
linear_module = nn.Linear(d, 1)
loss_func = nn.MSELoss()
optim = torch.optim.SGD(linear_module.parameters(), lr=step_size)
print('iter,\tloss,\tw')
for i in range(200):
rand_idx = np.random.choice(n) # take a random point from the dataset
x = X[rand_idx]
y_hat = linear_module(x)
loss = loss_func(y_hat, y[rand_idx]) # only compute the loss on the single point
optim.zero_grad()
loss.backward()
optim.step()
if i % 20 == 0:
print('{},\t{:.2f},\t{}'.format(i, loss.item(), linear_module.weight.view(2).detach().numpy()))
print('\ntrue w\t\t', true_w.view(2).numpy())
print('estimated w\t', linear_module.weight.view(2).detach().numpy())
iter, loss, w 0, 0.01, [0.02993712 0.40586257] 20, 7.11, [-0.21203727 0.6746018 ] 40, 0.13, [-0.5276561 1.248206 ] 60, 0.00, [-0.64190584 1.4578575 ] 80, 0.01, [-0.7105578 1.644668 ] 100, 0.01, [-0.84861267 1.8285881 ] 120, 0.00, [-0.884992 1.8483113] 140, 0.00, [-0.8915096 1.8892854] 160, 0.01, [-0.9286745 1.9184113] 180, 0.00, [-0.93960834 1.9385221 ] true w [-1. 2.] estimated w [-0.94389266 1.9486643 ]
visualize_fun(linear_module.weight.t(), 'Dataset with learned $w$ (PyTorch SGD)')
Let's consider the dataset from hw3. We will try and fit a simple neural network to the data.
%matplotlib inline
d = 1
n = 200
X = torch.rand(n,d)
y = 4 * torch.sin(np.pi * X) * torch.cos(6*np.pi*X**2)
plt.scatter(X.numpy(), y.numpy())
plt.title('plot of $f(x)$')
plt.xlabel('$x$')
plt.ylabel('$y$')
plt.show()
Here we define a simple two hidden layer neural network with Tanh activations. There are a few hyper parameters to play with to get a feel for how they change the results.
# feel free to play with these parameters
step_size = 0.05
n_epochs = 6000
n_hidden_1 = 32
n_hidden_2 = 32
d_out = 1
neural_network = nn.Sequential(
nn.Linear(d, n_hidden_1),
nn.Tanh(),
nn.Linear(n_hidden_1, n_hidden_2),
nn.Tanh(),
nn.Linear(n_hidden_2, d_out)
)
loss_func = nn.MSELoss()
optim = torch.optim.SGD(neural_network.parameters(), lr=step_size)
print('iter,\tloss')
for i in range(n_epochs):
y_hat = neural_network(X)
loss = loss_func(y_hat, y)
optim.zero_grad()
loss.backward()
optim.step()
if i % (n_epochs // 10) == 0:
print('{},\t{:.2f}'.format(i, loss.item()))
iter, loss 0, 3.49 600, 3.40 1200, 2.95 1800, 1.60 2400, 1.37 3000, 0.82 3600, 0.70 4200, 0.48 4800, 0.32 5400, 0.24
X_grid = torch.from_numpy(np.linspace(0,1,50)).float().view(-1, d)
y_hat = neural_network(X_grid)
plt.scatter(X.numpy(), y.numpy())
plt.plot(X_grid.detach().numpy(), y_hat.detach().numpy(), 'r')
plt.title('plot of $f(x)$ and $\hat{f}(x)$')
plt.xlabel('$x$')
plt.ylabel('$y$')
plt.show()
There are other optimization algorithms besides stochastic gradient descent. One is a modification of SGD called momentum. We won't get into it here, but if you would like to read more here is a good place to start.
We only change the step size and add the momentum keyword argument to the optimizer. Notice how it reduces the training loss in fewer iterations.
# feel free to play with these parameters
step_size = 0.05
momentum = 0.9
n_epochs = 1500
n_hidden_1 = 32
n_hidden_2 = 32
d_out = 1
neural_network = nn.Sequential(
nn.Linear(d, n_hidden_1),
nn.Tanh(),
nn.Linear(n_hidden_1, n_hidden_2),
nn.Tanh(),
nn.Linear(n_hidden_2, d_out)
)
loss_func = nn.MSELoss()
optim = torch.optim.SGD(neural_network.parameters(), lr=step_size, momentum=momentum)
print('iter,\tloss')
for i in range(n_epochs):
y_hat = neural_network(X)
loss = loss_func(y_hat, y)
optim.zero_grad()
loss.backward()
optim.step()
if i % (n_epochs // 10) == 0:
print('{},\t{:.2f}'.format(i, loss.item()))
iter, loss 0, 3.47 150, 2.94 300, 0.83 450, 0.55 600, 0.12 750, 0.10 900, 0.06 1050, 0.03 1200, 0.00 1350, 0.00
X_grid = torch.from_numpy(np.linspace(0,1,50)).float().view(-1, d)
y_hat = neural_network(X_grid)
plt.scatter(X.numpy(), y.numpy())
plt.plot(X_grid.detach().numpy(), y_hat.detach().numpy(), 'r')
plt.title('plot of $f(x)$ and $\hat{f}(x)$')
plt.xlabel('$x$')
plt.ylabel('$y$')
plt.show()
Often we do not want to use a fixed learning rate throughout all training. PyTorch offers learning rate schedulers to change the learning rate over time. Common strategies include multiplying the lr by a constant every epoch (e.g. 0.9) and halving the learning rate when the training loss flattens out.
See the learning rate scheduler docs for usage and examples
So far, we have been considering regression tasks and have used the MSELoss module. For the homework, we will be performing a classification task and will use the cross entropy loss.
PyTorch implements a version of the cross entropy loss in one module called CrossEntropyLoss. Its usage is slightly different than MSE, so we will break it down here.
Try out the loss function on three toy predictions. The true class labels are $y=[1,1,0]$. The first two examples correspond to predictions that are "correct" in that they have higher raw scores for the correct class. The second example is "more confident" in the prediction, leading to a smaller loss. The last two examples are incorrect predictions with lower and higher confidence respectively.
loss = nn.CrossEntropyLoss()
input = torch.tensor([[-1., 1],[-1, 1],[1, -1]]) # raw scores correspond to the correct class
# input = torch.tensor([[-3., 3],[-3, 3],[3, -3]]) # raw scores correspond to the correct class with higher confidence
# input = torch.tensor([[1., -1],[1, -1],[-1, 1]]) # raw scores correspond to the incorrect class
# input = torch.tensor([[3., -3],[3, -3],[-3, 3]]) # raw scores correspond to the incorrect class with incorrectly placed confidence
target = torch.tensor([1, 1, 0])
output = loss(input, target)
print(output)
tensor(0.1269)
When working with images, we often want to use convolutions to extract features using convolutions. PyTorch implements this for us in the torch.nn.Conv2d
module. It expects the input to have a specific dimension $(N, C_{in}, H_{in}, W_{in})$ where $N$ is batch size, $C_{in}$ is the number of channels the image has, and $H_{in}, W_{in}$ are the image height and width respectively.
We can modify the convolution to have different properties with the parameters:
They can change the output dimension so be careful.
See the torch.nn.Conv2d
docs for more information.
To illustrate what the Conv2d
module is doing, let's set the conv weights manually to a Gaussian blur kernel.
We can see that it applies the kernel to the image.
# an entire mnist digit
image = np.array([0., 0., 0., 0., 0.,0., 0., 0., 0., 0.,0., 0., 0., 0., 0.,0., 0., 0., 0., 0.,0., 0., 0., 0., 0.,0., 0., 0., 0., 0.,0., 0., 0., 0., 0.,0., 0., 0., 0., 0.,0., 0., 0., 0., 0.,0., 0., 0., 0., 0.,0., 0., 0., 0., 0.,0., 0., 0., 0., 0.,0., 0., 0., 0., 0.,0., 0., 0., 0., 0.,0., 0., 0., 0., 0.,0., 0., 0., 0., 0.,0., 0., 0., 0., 0.,0., 0., 0., 0., 0.,0., 0., 0., 0., 0.,0., 0., 0., 0., 0.,0., 0., 0., 0., 0.,0., 0., 0., 0., 0.,0., 0., 0., 0., 0.,0., 0., 0., 0., 0.,0., 0., 0., 0., 0.,0., 0., 0., 0., 0.,0., 0., 0., 0., 0.,0., 0., 0., 0., 0.,0., 0., 0., 0., 0.,0., 0., 0., 0., 0.,0., 0., 0., 0., 0.,0., 0., 0., 0., 0.,0., 0., 0., 0., 0.,0., 0., 0., 0., 0.,0., 0., 0., 0., 0.,0., 0., 0., 0., 0.,0., 0., 0., 0., 0.,0., 0., 0., 0., 0.,0., 0., 0., 0., 0.,0., 0., 0., 0., 0.,0., 0., 0., 0., 0.,0., 0., 0.3803922 , 0.37647063, 0.3019608 ,0.46274513, 0.2392157 , 0., 0., 0.,0., 0., 0., 0., 0.,0., 0., 0., 0., 0.,0., 0., 0.3529412 , 0.5411765 , 0.9215687 ,0.9215687 , 0.9215687 , 0.9215687 , 0.9215687 , 0.9215687 ,0.9843138 , 0.9843138 , 0.9725491 , 0.9960785 , 0.9607844 ,0.9215687 , 0.74509805, 0.08235294, 0., 0.,0., 0., 0., 0., 0.,0., 0., 0., 0., 0.54901963,0.9843138 , 0.9960785 , 0.9960785 , 0.9960785 , 0.9960785 ,0.9960785 , 0.9960785 , 0.9960785 , 0.9960785 , 0.9960785 ,0.9960785 , 0.9960785 , 0.9960785 , 0.9960785 , 0.9960785 ,0.7411765 , 0.09019608, 0., 0., 0.,0., 0., 0., 0., 0.,0., 0., 0.8862746 , 0.9960785 , 0.81568635,0.7803922 , 0.7803922 , 0.7803922 , 0.7803922 , 0.54509807,0.2392157 , 0.2392157 , 0.2392157 , 0.2392157 , 0.2392157 ,0.5019608 , 0.8705883 , 0.9960785 , 0.9960785 , 0.7411765 ,0.08235294, 0., 0., 0., 0.,0., 0., 0., 0., 0.,0.14901961, 0.32156864, 0.0509804 , 0., 0.,0., 0., 0., 0., 0.,0., 0., 0., 0., 0.13333334,0.8352942 , 0.9960785 , 0.9960785 , 0.45098042, 0.,0., 0., 0., 0., 0.,0., 0., 0., 0., 0.,0., 0., 0., 0., 0.,0., 0., 0., 0., 0.,0., 0., 0., 0.32941177, 0.9960785 ,0.9960785 , 0.9176471 , 0., 0., 0.,0., 0., 0., 0., 0.,0., 0., 0., 0., 0.,0., 0., 0., 0., 0.,0., 0., 0., 0., 0.,0., 0.32941177, 0.9960785 , 0.9960785 , 0.9176471 ,0., 0., 0., 0., 0.,0., 0., 0., 0., 0.,0., 0., 0., 0., 0.,0., 0., 0., 0., 0.,0., 0., 0., 0.4156863 , 0.6156863 ,0.9960785 , 0.9960785 , 0.95294124, 0.20000002, 0.,0., 0., 0., 0., 0.,0., 0., 0., 0., 0.,0., 0., 0., 0., 0.,0., 0.09803922, 0.45882356, 0.8941177 , 0.8941177 ,0.8941177 , 0.9921569 , 0.9960785 , 0.9960785 , 0.9960785 ,0.9960785 , 0.94117653, 0., 0., 0.,0., 0., 0., 0., 0.,0., 0., 0., 0., 0.,0., 0., 0.26666668, 0.4666667 , 0.86274517,0.9960785 , 0.9960785 , 0.9960785 , 0.9960785 , 0.9960785 ,0.9960785 , 0.9960785 , 0.9960785 , 0.9960785 , 0.5568628 ,0., 0., 0., 0., 0.,0., 0., 0., 0., 0.,0., 0., 0., 0.14509805, 0.73333335,0.9921569 , 0.9960785 , 0.9960785 , 0.9960785 , 0.8745099 ,0.8078432 , 0.8078432 , 0.29411766, 0.26666668, 0.8431373 ,0.9960785 , 0.9960785 , 0.45882356, 0., 0.,0., 0., 0., 0., 0.,0., 0., 0., 0., 0.,0.4431373 , 0.8588236 , 0.9960785 , 0.9490197 , 0.89019614,0.45098042, 0.34901962, 0.12156864, 0., 0.,0., 0., 0.7843138 , 0.9960785 , 0.9450981 ,0.16078432, 0., 0., 0., 0.,0., 0., 0., 0., 0.,0., 0., 0., 0.6627451 , 0.9960785 ,0.6901961 , 0.24313727, 0., 0., 0.,0., 0., 0., 0., 0.18823531,0.9058824 , 0.9960785 , 0.9176471 , 0., 0.,0., 0., 0., 0., 0.,0., 0., 0., 0., 0.,0., 0.07058824, 0.48627454, 0., 0.,0., 0., 0., 0., 0.,0., 0., 0.32941177, 0.9960785 , 0.9960785 ,0.6509804 , 0., 0., 0., 0.,0., 0., 0., 0., 0.,0., 0., 0., 0., 0.,0., 0., 0., 0., 0.,0., 0., 0., 0., 0.,0.54509807, 0.9960785 , 0.9333334 , 0.22352943, 0.,0., 0., 0., 0., 0.,0., 0., 0., 0., 0.,0., 0., 0., 0., 0.,0., 0., 0., 0., 0.,0., 0., 0.8235295 , 0.9803922 , 0.9960785 ,0.65882355, 0., 0., 0., 0.,0., 0., 0., 0., 0.,0., 0., 0., 0., 0.,0., 0., 0., 0., 0.,0., 0., 0., 0., 0.,0.9490197 , 0.9960785 , 0.93725497, 0.22352943, 0.,0., 0., 0., 0., 0.,0., 0., 0., 0., 0.,0., 0., 0., 0., 0.,0., 0., 0., 0., 0.,0., 0., 0.34901962, 0.9843138 , 0.9450981 ,0.3372549 , 0., 0., 0., 0.,0., 0., 0., 0., 0.,0., 0., 0., 0., 0.,0., 0., 0., 0., 0.,0., 0., 0., 0., 0.01960784,0.8078432 , 0.96470594, 0.6156863 , 0., 0.,0., 0., 0., 0., 0.,0., 0., 0., 0., 0.,0., 0., 0., 0., 0.,0., 0., 0., 0., 0.,0., 0., 0.01568628, 0.45882356, 0.27058825,0., 0., 0., 0., 0.,0., 0., 0., 0., 0.,0., 0., 0., 0., 0.,0., 0., 0., 0., 0.,0., 0., 0., 0., 0.,0., 0., 0., 0., 0.,0., 0., 0., 0., 0.,0., 0., 0., 0.], dtype=np.float32)
image_torch = torch.from_numpy(image).view(1, 1, 28, 28)
# a gaussian blur kernel
gaussian_kernel = torch.tensor([[1., 2, 1],[2, 4, 2],[1, 2, 1]]) / 16.0
conv = nn.Conv2d(1, 1, 3)
# manually set the conv weight
conv.weight.data[:] = gaussian_kernel
convolved = conv(image_torch)
plt.title('original image')
plt.imshow(image_torch.view(28,28).detach().numpy())
plt.show()
plt.title('blurred image')
plt.imshow(convolved.view(26,26).detach().numpy())
plt.show()
As we can see, the image is blurred as expected.
In practice, we learn many kernels at a time. In this example, we take in an RGB image (3 channels) and output a 16 channel image. After an activation function, that could be used as input to another Conv2d
module.
im_channels = 3 # if we are working with RGB images, there are 3 input channels, with black and white, 1
out_channels = 16 # this is a hyperparameter we can tune
kernel_size = 3 # this is another hyperparameter we can tune
batch_size = 4
image_width = 32
image_height = 32
im = torch.randn(batch_size, im_channels, image_width, image_height)
m = nn.Conv2d(im_channels, out_channels, kernel_size)
convolved = m(im) # it is a module so we can call it
print('im shape', im.shape)
print('convolved im shape', convolved.shape)
im shape torch.Size([4, 3, 32, 32]) convolved im shape torch.Size([4, 16, 30, 30])
When working with text, we often want to use text embeedings and recurrent neural networks. This is because we have to:
PyTorch implements these concept using torch.nn.Embedding
and torch.nn.RNN
.
Embedding converts a list of $n$ words into an $\mathbb{R}^{n \times d}$ matrix, where each word is respresented by $\mathbb{R}^d$ vector. When creating it you need to provide number of words in your dictionary, and the embedding size: $d$.
RNN allows us to use context by having the output $y$ be the function of input $x$, which we provide and the hidden state, which depends on previous inputs in a sequence. The layer takes input of (sequence length, batch size, $d_{in}$), and an optional argument for hidden state. It's output is a tuple containing actual output (sequence length, batch size, $d_{out}$), and hidden state.
For a longer explanation take a look a this post. While it's on topic of LSTMs the beginning does a good explanation of how RNNs work. If you really want to take a deep dive this post goes in-depth on various applications of RNNs and LSTMs.
Lastly if that's something that you are interested in consider CSE 447 (NLP).
# Feel free to play with parameters
embedding_size = 3
num_unique_words = 10
sentence_length = 2
batch_size = 1
hidden_size = 5
# Let's generate data of shape (sentence_length, batch_size)
data = torch.randint(high=num_unique_words, size=(sentence_length, batch_size))
embedding = nn.Embedding(num_unique_words, embedding_size)
rnn_layer = nn.RNN(embedding_size, hidden_size)
print(f"Input Data shape: {data.shape}")
embedded_vec = embedding(data)
print(f"After Embedding shape: {embedded_vec.shape}")
result, hidden = rnn_layer(embedded_vec)
print(f"After RNN output shape: {result.shape}") # (sequence length, batch size, hidden_size)
print(f"After RNN hidden shape: {hidden.shape}") # (# layers, batch size, hidden_size)
Input Data shape: torch.Size([2, 1]) After Embedding shape: torch.Size([2, 1, 3]) After RNN output shape: torch.Size([2, 1, 5]) After RNN hidden shape: torch.Size([1, 1, 5])
def process_corpus(corpus, sentence_length):
"""
Arguments:
corpus (str) -- Continous text. Can be anything but should be relatively long.
sentence_length (int) -- Size of each sentence in the output.
Does not have to be divisible by # of words in corpus, in which case end will be padded.
Returns:
Tuple of size 4 containing:
- Train Input - shape (batch, sentence) containing indexes of words for each sentence.
- Train Truth - Same as Train Input but contains index of the next word in a given sentence.
- Word to Index Dictionary - Dictionary for each word containing a corresponding integer.
- Index to Word Dictionary - Reverse of Word to Index Dictionary.
Example:
process_corpus("Sam likes cats", 2) outputs:
- [[1, 2], [3, 0]]
- [[2, 3], [0, 0]]
- {"": 0, "Sam": 1, "likes": 2, "cats": 3}
- {0: "", 1: "Sam", 2: "likes", 3: "cats"}
"""
# Let's make corpus a list of words
corpus = corpus.split()
# QUESTION: Should we also trim/lowercase the words here? Is "You," vs. "you" very different?
# Then split it into smaller sentences of size sentence_length
x = []
y = []
for idx in range(0, len(corpus), sentence_length):
x.append(corpus[idx: idx + sentence_length])
# Since we are trying to predict the next word y's are just x's shifted by one
y.append(corpus[idx + 1: idx + sentence_length + 1])
# Last sentences might be shorter. Let's pad it with something smaller
x[-1] += ["" for _ in range(sentence_length - len(x[-1]))]
y[-1] += ["" for _ in range(sentence_length - len(y[-1]))]
# Create dictionary from words to indices and vice-versa
# QUESTION: Is "" a good choice for end-of-sentence tag? Maybe we should pad beginning of the sentences too?
idx_to_word = {0: ""}
word_to_idx = {"": 0}
idx = 1
for sentence in x:
for word in sentence:
if word not in word_to_idx:
word_to_idx[word] = idx
idx_to_word[idx] = word
idx += 1
x_idx = torch.tensor([[word_to_idx[w] for w in s] for s in x]).long()
y_idx = torch.tensor([[word_to_idx[w] for w in s] for s in y]).long()
return x_idx, y_idx, word_to_idx, idx_to_word
# Feel free to play with parameters
embedding_size = 10
sentence_length = 5
hidden_size = 5
n_epochs = 1000
# Dataset
corpus = "Hey, you. You’re finally awake. " \
"You were trying to cross the border, right? " \
"Walked right into that Imperial ambush, " \
"same as us, and that thief over there. " \
"Skyrim was fine until you came along. " \
"Empire was nice and lazy. " \
"If they hadn’t been looking for you, " \
"I could’ve stolen that horse and been half way to Hammerfell. " \
"You there. You and me — we should be here. " \
"It’s these Stormcloaks the Empire wants. "
x, y, _, idx_to_word = process_corpus(corpus, sentence_length)
model_rnn = nn.Sequential(
nn.Embedding(len(idx_to_word), embedding_size),
nn.RNN(embedding_size, hidden_size, batch_first=True),
)
# Linear model has to be separate, because we'll be using only first output of the RNN
linear = nn.Linear(hidden_size, len(idx_to_word))
print(model_rnn)
criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(list(model_rnn.parameters()) + list(linear.parameters()))
for i in range(n_epochs):
x_mid, _ = model_rnn(x)
y_hat = linear(x_mid).transpose(1, 2) # This makes shape correct for the Loss
loss = criterion(y_hat, y)
optimizer.zero_grad()
loss.backward()
optimizer.step()
if i % (n_epochs // 10) == 0:
print('{},\t{:.2f}'.format(i, loss.item()))
Sequential( (0): Embedding(61, 10) (1): RNN(10, 5, batch_first=True) ) 0, 4.29 100, 3.84 200, 3.41 300, 3.00 400, 2.66 500, 2.40 600, 2.18 700, 2.00 800, 1.83 900, 1.69
# Let's see a prediction for the first sentence
with torch.no_grad():
y_hat = linear(model_rnn(x)[0])
y_hat = torch.argmax(y_hat, dim=2)
sentences_hat = [[idx_to_word[int(w)] for w in s] for s in y_hat]
sentences_true = [[idx_to_word[int(w)] for w in s] for s in y]
sentence_idx = 0
print(f"Truth: {sentences_true[sentence_idx]}")
print(f"Predict: {sentences_hat[sentence_idx]}")
Truth: ['you.', 'You’re', 'finally', 'awake.', 'You'] Predict: ['', 'Empire', 'finally', 'awake.', 'You']