PyTorch Introduction

Today, we will be intoducing PyTorch, "an open source deep learning platform that provides a seamless path from research prototyping to production deployment".

This notebook is by no means comprehensive. If you have any questions the documentation and Google are your friends.

Goal takeaways:

Tensors and relation to numpy

By this point, we have worked with numpy quite a bit. PyTorch's basic building block, the tensor is similar to numpy's ndarray

Tensor.view

We can use the Tensor.view() function to reshape tensors similarly to numpy.reshape().

Note: A imporant difference between view and reshape is that view returns reference to the same tensor as the one passed in. This means that if we modify values in the output of view they will also change for its input. This can lead to some issues. For more information see PyTorch.

Similarly to reshape it can also automatically calculate the correct dimension if a -1 is passed in. This is useful if we are working with batches, but the batch size is unknown.

PyTorch as an auto grad framework

Main benefit of PyTorch is that it keeps track of gradients for us, as we do the calculations. This is done through computation graphs, which you can read more about in Appendix 1 of this notebook. The example below shows how to use these gradients.

Consider the function $f(x) = (x-2)^2$.

Q: Compute $\frac{d}{dx} f(x)$ and then compute $f'(1)$.

We make a backward() call on the leaf variable (y) in the computation, computing all the gradients of y at once.

It can also find gradients of functions.

Let $w = [w_1, w_2]^T$

Consider $g(w) = 2w_1w_2 + w_2\cos(w_1)$

Q: Compute $\nabla_w g(w)$ and verify $\nabla_w g([\pi,1]) = [2, \pi - 1]^T$

Using the gradients - Linear regression using GD with torch

Now that we have gradients, we can use our favorite optimization algorithm: gradient descent!

Note: This example is an illustration to connect ideas we have seen before to PyTorch's way of doing things. We will see how to do this in the "PyTorchic" way in the next example.

But first lets generate synthetic data to solve on our problem.

We will also define a helper function to visualize result's of what $\hat{w}$ we have learned.

Algorithm for Linear regression using GD with automatically computed derivatives

Note: This example is an illustration to connect ideas we have seen before to PyTorch's way of doing things. We will see how to do this in the "PyTorchic" way in the next example.

torch.nn.Module

Module is PyTorch's way of performing operations on tensors. Modules are implemented as subclasses of the torch.nn.Module class. All modules are callable and can be composed together to create complex functions.

torch.nn docs

Note: most of the functionality implemented for modules can be accessed in a functional form via torch.nn.functional, but these require you to create and manage the weight tensors yourself. Note: This

torch.nn.functional docs.

Linear Module

The bread and butter of modules is the Linear module which does a linear transformation with a bias. It takes the input and output dimensions as parameters, and creates the weights in the object. It is just a matrix multiplication and addition of bias:

$$ f(X) = XW + b, f: \mathbb{R}^{n \times d} \rightarrow \mathbb{R}^{n \times h} $$

where $X \in \mathbb{R}^{n \times d}$, $W \in \mathbb{R}^{d \times h}$ and $b \in \mathbb{R}^{h}$

Unlike how we initialized our $w$ manually, the Linear module automatically initializes the weights randomly. For minimizing non convex loss functions (e.g. training neural networks), initialization is important and can affect results. If training isn't working as well as expected, one thing to try is manually initializing the weights to something different from the default. PyTorch implements some common initializations in torch.nn.init.

torch.nn.init docs

Activation functions

PyTorch implements a number of activation functions including but not limited to ReLU, Tanh, and Sigmoid. Since they are modules, they need to be instantiated.

Sequential

Many times, we want to compose Modules together. torch.nn.Sequential provides a good interface for composing simple modules.

Note: we can access all of the parameters (of any nn.Module) with the parameters() method.

Loss functions

PyTorch implements many common loss functions including MSELoss and CrossEntropyLoss.

torch.optim

PyTorch implements a number of gradient-based optimization methods in torch.optim, including Gradient Descent. At the minimum, it takes in the model parameters and a learning rate.

Optimizers do not compute the gradients for you, so you must call backward() yourself. You also must call the optim.zero_grad() function before calling backward() since by default PyTorch does and inplace add to the .grad member variable rather than overwriting it.

This does both the detach_() and zero_() calls on all tensor's grad variables.

torch.optim docs

As we can see, the parameter was updated in the correct direction

Linear regression using GD with torch.nn module

Now let's combine what we've learned to solve linear regression in a "PyTorchic" way.

Linear regression using SGD

In the previous examples, we computed the average gradient over the entire dataset (Gradient Descent). We can implement Stochastic Gradient Descent with a simple modification.

Neural Network Basics in PyTorch

Let's consider the dataset from hw3. We will try and fit a simple neural network to the data.

Here we define a simple two hidden layer neural network with Tanh activations. There are a few hyper parameters to play with to get a feel for how they change the results.

Useful links:

Appendix 1: Computation graphs

What's special about PyTorch's tensor object is that it implicitly creates a computation graph in the background. A computation graph is a a way of writing a mathematical expression as a graph. There is an algorithm to compute the gradients of all the variables of a computation graph in time on the same order it is to compute the function itself.

Consider the expression $e=(a+b)*(b+1)$ with values $a=2, b=1$. We can draw the evaluated computation graph as

source

In PyTorch, we can write this as

We can see that PyTorch kept track of the computation graph for us.

Appendix 2: Things that might help on the homework

Momentum

There are other optimization algorithms besides stochastic gradient descent. One is a modification of SGD called momentum. We won't get into it here, but if you would like to read more here is a good place to start.

We only change the step size and add the momentum keyword argument to the optimizer. Notice how it reduces the training loss in fewer iterations.

CrossEntropyLoss

So far, we have been considering regression tasks and have used the MSELoss module. For the homework, we will be performing a classification task and will use the cross entropy loss.

PyTorch implements a version of the cross entropy loss in one module called CrossEntropyLoss. Its usage is slightly different than MSE, so we will break it down here.

Try out the loss function on three toy predictions. The true class labels are $y=[1,1,0]$. The first two examples correspond to predictions that are "correct" in that they have higher raw scores for the correct class. The second example is "more confident" in the prediction, leading to a smaller loss. The last two examples are incorrect predictions with lower and higher confidence respectively.

Learning rate schedulers

Often we do not want to use a fixed learning rate throughout all training. PyTorch offers learning rate schedulers to change the learning rate over time. Common strategies include multiplying the lr by a constant every epoch (e.g. 0.9) and halving the learning rate when the training loss flattens out.

See the learning rate scheduler docs for usage and examples

Appendix 3: Beyond Linear Layers

Convolutions

When working with images, we often want to use convolutions to extract features using convolutions. PyTorch implments this for us in the torch.nn.Conv2d module. It expects the input to have a specific dimension $(N, C_{in}, H_{in}, W_{in})$ where $N$ is batch size, $C_{in}$ is the number of channels the image has, and $H_{in}, W_{in}$ are the image height and width respectively.

We can modify the convolution to have different properties with the parameters:

They can change the output dimension so be careful.

See the torch.nn.Conv2d docs for more information.

To illustrate what the Conv2d module is doing, let's set the conv weights manually to a Gaussian blur kernel.

We can see that it applies the kernel to the image.

As we can see, the image is blurred as expected.

In practice, we learn many kernels at a time. In this example, we take in an RGB image (3 channels) and output a 16 channel image. After an activation function, that could be used as input to another Conv2d module.

Recurrent Cells (or Recurrent Neural Networks)

Useful links: