PyTorch is a deep learning package for building dynamic computation graphs.
More broadly, it's a GPU-compatible replacement for NumPy. You can think of it as NumPy + auto-differentiation.
The Tensor
type is essentially a NumPy ndarray
.
There are several types of Tensor
s, each of which correspond to a NumPy dtype
.
The main ones you will probably use are:
Data Type | Tensor Type | NumPy dtype |
---|---|---|
32-bit floating point | torch.FloatTensor |
float32 |
8-bit integer (unsigned) | torch.ByteTensor |
uint8 |
64-bit integer (signed) | torch.LongTensor |
int64 |
In general, you want to use FloatTensor
by default, unless your data is specifically an integer (in which case you'd use a LongTensor
) or your data is bits (you'd want to use ByteTensor
).
You can find a full list of tensor types here.
To construct a uninitialized 4x6 matrix (think malloc
, so not guaranteed to be all 0
), we can use:
# Import PyTorch and other libraries
import torch
import numpy as np
# Note that torch.Tensor is short for torch.FloatTensor
uninit_float = torch.Tensor(4, 6)
print(uninit_float)
print("Type of above Tensor (it's also printed when you print the tensor):")
print(type(uninit_float))
We can also create Tensors directly from (optionally nested) lists.
some_float_tensor = torch.Tensor([3.2, 4.3, 5.5])
some_float_tensor2 = torch.Tensor(np.array([[3.2], [4.3], [5.5]]))
print(some_float_tensor)
print(some_float_tensor2)
We can add with the normal Python +
operator.
rand_float = torch.rand(4, 6)
other_rand_float = torch.rand(4, 6)
# Three ways to add!
# Python Operator +
print(rand_float + other_rand_float)
print(torch.add(rand_float, other_rand_float))
rand_float.add_(other_rand_float)
Broadcasting is a construct in NumPy and PyTorch that lets operations apply to tensors of different shapes. Under certain conditions, a smaller tensor can be "broadcast" across a bigger one. This is often desirable to do, since the looping happens at the C-level and is incredibly efficient in both speed and memory.
In the example below, x
has shape (3,)
and y has shape (5, 3)
. We can still add them together --- the smaller tensor is automatically added to each row of the larger tensor.
# Random LongTensors from 0 to 9.
x = torch.LongTensor(3).random_(10)
y = torch.LongTensor(5, 3).random_(10)
print(x)
print(y)
print(x+y)
Broadcasting, if used improperly, can also lead to inadvertent bugs.
Consider this example: Say you want to multiply a matrix of shape (4, 6)
with one of shape (6, 4)
to get something of shape (4, 4)
. You might be tempted to use the *
operator, which is for elementwise
multiplication. For matrix multiplication, we use either Tensor.mm
or the @
operator.
However, broadcasting leads to a particularly nasty bug that is hard to detect due to broadcast (this behavior is thankfully being deprecated by PyTorch).
x = torch.LongTensor(4, 6).random_(10) # [4,6]
y = torch.LongTensor(6, 4).random_(10) # [6,4]
print("x: ", x)
print("y: ", y)
# Matrix multiply
print("x @ y (matrix multiply) : ", x @ y)
# USUALLY UNINTENTIONAL ELEMENTWISE-MULTIPLICATION
print("x * y (elementwise multiply) : ", x * y)
A big part of programming with tensors is keeping track of the expected shapes of your tensors and whether these shapes are actually showing up --- doing so will dramatically reduce the amount of bugs you have.
It's often desirable to reshape a Tensor, maybe to broadcast with something else or to turn it into something that is easier to reason about.
We can do that with the .view
function.
x = torch.LongTensor(4, 4).random_(10)
print(x)
# Turn it into a Tensor of shape (2, 8)
y = x.view(2, 8)
print(y)
# Turn it into a Tensor of shape (8, ?).
# The -1 is inferred from the shape of the Tensor.
z = x.view(8, -1)
print(z)
# Turn it into a Tensor of shape (16,) (flatten it).
# This is the same as x.view(16).
flat = x.view(-1)
print(flat)
PyTorch follows the same conventions that NumPy uses for array slicing and indexing. Here's a good intro to slicing and indexing in NumPy.
x = torch.LongTensor(3, 5).random_(10)
print(x)
# Get the first row
print("First row:")
print(x[0])
# Get the last row
print("Last row:")
print(x[-1])
# Get the 3rd column
print("3rd column from left:")
print(x[:, 2])
# Get the last column
print("Last column from left:")
print(x[:, -1])
A computation graph is simply a way to define a sequence of operations to go from input to model output.
You can think of the nodes in the graph as representing operations, and the edges in the graph represent tensors going in and out.
For example, say we wanted to build a linear regression model. This has the form $\hat y = Wx + b$.
In this equation, $x$ is our input, $W$ is a learned weight matrix, $b$ is a learned bias, and $\hat y$ is the predicted output.
As a computation graph, this looks like:
When implementing models, you're basically designing and specifying computation graphs. It's a bit like playing with Legos in that you're stringing together a bunch of blocks (the operations) to achieve a final desired output.
One of PyTorch's key features (and what makes it a deep learning library) is the ability to specify arbitrary computation graphs and compute gradients on them automatically.
The main abstraction it uses to do this is torch.autograd.Variable
. A Variable
wraps a tensor and stores:
.data
member).grad
member).grad_fn
member)You can think of the Variable
as a container, keeping track of the 3 items above.
# Variable is in the autograd module
from torch.autograd import Variable
# Variables wrap tensors
x = Variable(torch.Tensor([1, 2, 3]), requires_grad=True)
# You can access the underlying tensor with the .data attribute
print(x.data)
# Any operation you could use on Tensors, you can use on Variables
# Operations between Variables produce Variables.
y = Variable(torch.Tensor([4, 5, 6]), requires_grad=True)
z = x + y
print(z.data)
# But z also stores where it came from!
print(z.grad_fn)
A note on the requires_grad
argument: with most NN code, you don’t want to set requires_grad=True
unless you explicitly want the gradient w.r.t. to your input. In this example, however, requires_grad=True
is necessary because otherwise there would be no gradients to compute, since there are no model parameters.
Let's do some more operations and calculate the gradient.
z_sum = torch.sum(z)
print(z_sum)
print(z_sum.grad_fn)
Say we want to calculate the derivative of the sum w.r.t. the first element of x (in math, $\frac{\partial z_{sum}}{\partial x_0}$).
Autograd knows that: $$ z_{sum} = x_0 + y_0 + x_1 + y_1 + x_2 + y_2$$
So the derivative of $z_{sum}$ w.r.t $x_0$ is 1! Similarily, the derivative to all elements of $x$ is 1. Let's verify this with autograd.
# Backprop from s backwards into the grpah
# It'll follow the chain of computation by going from grad_fn to grad_fn
# until it reaches the input.
z_sum.backward()
print(x.grad)
Try running the block above multiple times! What do you notice happening?
The gradient in .grad
accumulates each time we call .backward()
--- this is convenient for some models, but we'll usually want to zero the gradient before running backpropagation when we're training our models (more on this later).
In most models we build, we'll generally want to explicitly zero-out the gradients (details forthcoming) before calling .backward()
At the highest level, nn.Module
defines what most would refer to as a "model". It's a convenient way for encapsulating the trainable parameters of a model or a component of your model, and subclassing this class gives you Python functions for saving it, loading it etc.
When you're building your own model, you're going to subclass nn.Module
. Critically, you also need to override the __init__()
and forward()
functions.
__init__()
, you should take arguments that modify how the model runs (e.g. # of layers, # of hidden units, output sizes). You'll also set up most of the layers that you use in the forward pass here.forward()
, you define the "forward pass" of your model, or the operations needed to transform input to output. You can use any of the Tensor operations in the forward pass.The Linear Module defines a module that is of the form $y = Ax + b$ where A and b are learnable parameters
input_size = 1
output_size = 1
my_model = torch.nn.Linear(input_size, output_size)
# we can see A and b
print(list(my_model.parameters()))
To call the model, we treat it as a function
x = Variable(torch.Tensor([[0],[1],[2]])) # represents putting multiple data points through the model
my_model(x)
Remember Gradient Descent and Stochastic Gradient Descent? :)
import matplotlib.pyplot as plt
%matplotlib inline
# lets generate some linear data
N = 10
# params we are trying to learn
A = torch.Tensor([np.pi])
b = torch.Tensor([42])
x = torch.rand(N, input_size) * 5
y = A * x + b + torch.rand(N, input_size) # add some noise to spice it up
plt.scatter(x.numpy(), y.numpy())
plt.show()
Intuitively, loss functions serve to tell your model how poorly it's doing --- the purpose of training is to adjust the weights of our model to minimize the loss.
A loss function takes a true output $y$ and a model-predicted output $\hat y$ and calculates the loss. If $y = \hat y$, our model produced the correct output and thus our loss is 0. The further our predicted $\hat y$ from the true $y$, the higher our loss is.
PyTorch comes with a large collection of loss functions. A common loss used for regression problems is the mean squared error (nn.MSELoss
).
import torch.nn.functional as F
x_var = Variable(x)
y_var = Variable(y)
learning_rate = 0.05
num_epochs = 500
for idx in range(num_epochs):
my_model.zero_grad()
y_hat = my_model(x_var)
loss = F.mse_loss(y_hat, y_var) # this is the mean squared error
loss.backward() # it computes the gradient for us!
for param in my_model.parameters():
param.data.add_(-learning_rate * param.grad.data)
if idx % 50 == 0:
print(idx, loss.data[0])
xs = Variable(torch.arange(0, 5, 0.01).view(-1, 1))
ys = my_model(xs)
plt.scatter(xs.data.numpy(), ys.data.numpy(), marker='.')
plt.scatter(x.numpy(), y.numpy())
plt.show()
print(list(my_model.parameters()))
Now that we can calculate the loss and backpropagate through our model (with .backward()
), we can update the weights and try to reduce the loss!
PyTorch includes a variety of optimizers that do exactly this, from the standard SGD to more advancedtechniques like Adam and RMSProp.
At construction, PyTorch parameters take the parameters to optimize. When we run an input through our model, calculate the loss, and backpropagate, the gradients are automatically stored in the parameters (since they're all Variables
). With these gradients, the optimizer can update the weights.
Optimizers live in the torch.optim
module.
import torch.optim as optim
learning_rate = 0.05
my_model = torch.nn.Linear(input_size, output_size) # reset model
opt = optim.SGD(my_model.parameters(), lr=learning_rate)
num_epochs = 500
for idx in range(num_epochs):
my_model.zero_grad()
y_hat = my_model(x_var)
loss = F.mse_loss(y_hat, y_var) # this is the mean squared error
loss.backward() # it computes the gradient for us!
opt.step() # this does the parameter for us!
if idx % 50 == 0:
print(idx, loss.data[0])
xs = Variable(torch.arange(0, 5, 0.01).view(-1, 1))
ys = my_model(xs)
plt.scatter(xs.data.numpy(), ys.data.numpy(), marker='.')
plt.scatter(x.numpy(), y.numpy())
plt.show()
print(list(my_model.parameters()))
As we can see, it worked!
Speical thanks to Nelson Liu for the base materials
Useful links: