PyTorch Introduction - Basics

Today, we will be intoducing PyTorch, "an open source deep learning platform that provides a seamless path from research prototyping to production deployment".

This notebook is by no means comprehensive. If you have any questions the documentation and Google are your friends.

Goal takeaways:

Installation

Before we even import torch packages, a quick note on how to install PyTorch with as little headache as possible. Go to PyTorch website, and scroll down until you see "INSTALL PYTORCH" section.

There you will be able to choose specific version for each OS, package indexes (pip vs conda), but most importantly which version of CUDA to use. For this class we advise you to choose "None" option, as models will not be too big.

However, if you have NVIDIA GPU (or want to use lab machines with GPUs) and/or are curious, you should run nvidia-smi command on that machine and look in right-top for CUDA version. Then choose it in the installation.

Tensors and relation to numpy

By this point, we have worked with numpy quite a bit. PyTorch's basic building block, the tensor is similar to numpy's ndarray

Tensor.view

We can use the Tensor.view() function to reshape tensors similarly to numpy.reshape()

It can also automatically calculate the correct dimension if a -1 is passed in. This is useful if we are working with batches, but the batch size is unknown.

PyTorch as an auto grad framework

The special thing about PyTorch's tensors is that it computes some gradients for us. This is done through Computational Graphs, which are explained at the end of this notebook.

Consider the function $f(x) = (x-2)^2$.

Q: Compute $\frac{d}{dx} f(x)$ and then compute $f'(1)$.

We make a backward() call on the leaf variable (y) in the computation, computing all the gradients of y at once.

It can also find gradients of functions.

Let $w = [w_1, w_2]^T$

Consider $g(w) = 2w_1w_2 + w_2\cos(w_1)$

Q: Compute $\nabla_w g(w)$ and verify $\nabla_w g([\pi,1]) = [2, \pi - 1]^T$

Using the gradients

Now that we have gradients, we can use our favorite optimization algorithm: gradient descent!

Let $f$ the same function we defined above.

Q: What is the value of $x$ that minimizes $f$?

Linear Regression

Now, instead of minimizing a made-up function, lets minimize a loss function on some made-up data.

We will implement Gradient Descent in order to solve the task of linear regression.

Note: dimensions

PyTorch does a lot of operations on batches of data. The convention is to have your data be of size $(N, d)$ where $N$ is the size of the batch of data.

Sanity check

To verify PyTorch is computing the gradients correctly, let's recall the gradient for the RSS objective:

$$\nabla_w \mathcal{L}_{RSS}(w; X) = \nabla_w\frac{1}{n} ||y - Xw||_2^2 = -\frac{2}{n}X^T(y-Xw)$$

Let's see if the match up:

Now that we've seen PyTorch is doing the right think, let's use the gradients!

Linear regression using GD with automatically computed derivatives

We will now use the gradients to run the gradient descent algorithm.

Note: This example is an illustration to connect ideas we have seen before to PyTorch's way of doing things. We will see how to do this in the "PyTorchic" way in the next example.

Appendix - Computation graphs

What's special about PyTorch's tensor object is that it implicitly creates a computation graph in the background. A computation graph is a a way of writing a mathematical expression as a graph. There is an algorithm to compute the gradients of all the variables of a computation graph in time on the same order it is to compute the function itself.

Consider the expression $e=(a+b)*(b+1)$ with values $a=2, b=1$. We can draw the evaluated computation graph as

tree-img

source

In PyTorch, we can write this as

We can see that PyTorch kept track of the computation graph for us.

Useful links: