{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# PyTorch Introduction\n",
    "\n",
    "Today, we will be intoducing PyTorch, \"an open source deep learning platform that provides a seamless path from research prototyping to production deployment\".\n",
    "\n",
    "This notebook is by no means comprehensive. If you have any questions the documentation and Google are your friends.\n",
    "\n",
    "Goal takeaways:\n",
    "- Automatic differentiation is a powerful tool\n",
    "- PyTorch implements common functions used in deep learning"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import torch\n",
    "import torch.nn as nn\n",
    "import torch.nn.functional as F\n",
    "\n",
    "from mpl_toolkits.mplot3d import Axes3D\n",
    "import matplotlib.pyplot as plt\n",
    "\n",
    "import numpy as np\n",
    "\n",
    "torch.manual_seed(446)\n",
    "np.random.seed(446)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Tensors and relation to numpy\n",
    "\n",
    "By this point, we have worked with numpy quite a bit. PyTorch's basic building block, the `tensor` is similar to numpy's `ndarray`"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# we create tensors in a similar way to numpy nd arrays\n",
    "x_numpy = np.array([0.1, 0.2, 0.3])\n",
    "x_torch = torch.tensor([0.1, 0.2, 0.3])\n",
    "print('x_numpy, x_torch')\n",
    "print(x_numpy, x_torch)\n",
    "print()\n",
    "\n",
    "# to and from numpy, pytorch\n",
    "print('to and from numpy and pytorch')\n",
    "print(torch.from_numpy(x_numpy), x_torch.numpy())\n",
    "print()\n",
    "\n",
    "# we can do basic operations like +-*/\n",
    "y_numpy = np.array([3,4,5.])\n",
    "y_torch = torch.tensor([3,4,5.])\n",
    "print(\"x+y\")\n",
    "print(x_numpy + y_numpy, x_torch + y_torch)\n",
    "print()\n",
    "\n",
    "# many functions that are in numpy are also in pytorch\n",
    "print(\"norm\")\n",
    "print(np.linalg.norm(x_numpy), torch.norm(x_torch))\n",
    "print()\n",
    "\n",
    "# to apply an operation along a dimension,\n",
    "# we use the dim keyword argument instead of axis\n",
    "print(\"mean along the 0th dimension\")\n",
    "x_numpy = np.array([[1,2],[3,4.]])\n",
    "x_torch = torch.tensor([[1,2],[3,4.]])\n",
    "print(np.mean(x_numpy, axis=0), torch.mean(x_torch, dim=0))\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### `Tensor.view`\n",
    "We can use the `Tensor.view()` function to reshape tensors similarly to `numpy.reshape()`\n",
    "\n",
    "It can also automatically calculate the correct dimension if a `-1` is passed in. This is useful if we are working with batches, but the batch size is unknown."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# \"MNIST\"\n",
    "N, C, W, H = 10000, 3, 28, 28\n",
    "X = torch.randn((N, C, W, H))\n",
    "\n",
    "print(X.shape)\n",
    "print(X.view(N, C, 784).shape)\n",
    "print(X.view(-1, C, 784).shape) # automatically choose the 0th dimension"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Computation graphs\n",
    "\n",
    "What's special about PyTorch's `tensor` object is that it implicitly creates a computation graph in the background. A computation graph is a a way of writing a mathematical expression as a graph. There is an algorithm to compute the gradients of all the variables of a computation graph in time on the same order it is to compute the function itself.\n",
    "\n",
    "Consider the expression $e=(a+b)*(b+1)$ with values $a=2, b=1$. We can draw the evaluated computation graph as\n",
    "<br>\n",
    "<br>\n",
    "\n",
    "![tree-img](img/tree-eval.png)\n",
    "\n",
    "[source](https://colah.github.io/posts/2015-08-Backprop/)\n",
    "\n",
    "In PyTorch, we can write this as"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "a = torch.tensor(2.0, requires_grad=True) # we set requires_grad=True to let PyTorch know to keep the graph\n",
    "b = torch.tensor(1.0, requires_grad=True)\n",
    "c = a + b\n",
    "d = b + 1\n",
    "e = c * d\n",
    "print('c', c)\n",
    "print('d', d)\n",
    "print('e', e)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can see that PyTorch kept track of the computation graph for us."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## PyTorch as an auto grad framework\n",
    "\n",
    "Now that we have seen that PyTorch keeps the graph around for us, let's use it to compute some gradients for us.\n",
    "\n",
    "Consider the function $f(x) = (x-2)^2$.\n",
    "\n",
    "Q: Compute $\\frac{d}{dx} f(x)$ and then compute $f'(1)$.\n",
    "\n",
    "We make a `backward()` call on the leaf variable (`y`) in the computation, computing all the gradients of `y` at once."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def f(x):\n",
    "    return (x-2)**2\n",
    "\n",
    "def fp(x):\n",
    "    return 2*(x-2)\n",
    "\n",
    "x = torch.tensor([1.0], requires_grad=True)\n",
    "\n",
    "y = f(x)\n",
    "y.backward()\n",
    "\n",
    "print('Analytical f\\'(x):', fp(x))\n",
    "print('PyTorch\\'s f\\'(x):', x.grad)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "It can also find gradients of functions.\n",
    "\n",
    "Let $w = [w_1, w_2]^T$\n",
    "\n",
    "Consider $g(w) = 2w_1w_2 + w_2\\cos(w_1)$\n",
    "\n",
    "Q: Compute $\\nabla_w g(w)$ and verify $\\nabla_w g([\\pi,1]) = [2, \\pi - 1]^T$"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def g(w):\n",
    "    return 2*w[0]*w[1] + w[1]*torch.cos(w[0])\n",
    "\n",
    "def grad_g(w):\n",
    "    return torch.tensor([2*w[1] - w[1]*torch.sin(w[0]), 2*w[0] + torch.cos(w[0])])\n",
    "\n",
    "w = torch.tensor([np.pi, 1], requires_grad=True)\n",
    "\n",
    "z = g(w)\n",
    "z.backward()\n",
    "\n",
    "print('Analytical grad g(w)', grad_g(w))\n",
    "print('PyTorch\\'s grad g(w)', w.grad)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Using the gradients\n",
    "Now that we have gradients, we can use our favorite optimization algorithm: gradient descent!\n",
    "\n",
    "Let $f$ the same function we defined above.\n",
    "\n",
    "Q: What is the value of $x$ that minimizes $f$?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "x = torch.tensor([5.0], requires_grad=True)\n",
    "step_size = 0.25\n",
    "\n",
    "print('iter,\\tx,\\tf(x),\\tf\\'(x),\\tf\\'(x) pytorch')\n",
    "for i in range(15):\n",
    "    y = f(x)\n",
    "    y.backward() # compute the gradient\n",
    "    \n",
    "    print('{},\\t{:.3f},\\t{:.3f},\\t{:.3f},\\t{:.3f}'.format(i, x.item(), f(x).item(), fp(x).item(), x.grad.item()))\n",
    "    \n",
    "    x.data = x.data - step_size * x.grad # perform a GD update step\n",
    "    \n",
    "    # We need to zero the grad variable since the backward()\n",
    "    # call accumulates the gradients in .grad instead of overwriting.\n",
    "    # The detach_() is for efficiency. You do not need to worry too much about it.\n",
    "    x.grad.detach_()\n",
    "    x.grad.zero_()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Linear Regression\n",
    "\n",
    "Now, instead of minimizing a made-up function, lets minimize a loss function on some made-up data.\n",
    "\n",
    "We will implement Gradient Descent in order to solve the task of linear regression."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# make a simple linear dataset with some noise\n",
    "\n",
    "d = 2\n",
    "n = 50\n",
    "X = torch.randn(n,d)\n",
    "true_w = torch.tensor([[-1.0], [2.0]])\n",
    "y = X @ true_w + torch.randn(n,1) * 0.1\n",
    "print('X shape', X.shape)\n",
    "print('y shape', y.shape)\n",
    "print('w shape', true_w.shape)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Note: dimensions\n",
    "PyTorch does a lot of operations on batches of data. The convention is to have your data be of size $(N, d)$ where $N$ is the size of the batch of data."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# visualize the dataset\n",
    "%matplotlib notebook\n",
    "fig = plt.figure()\n",
    "ax = fig.add_subplot(111, projection='3d')\n",
    "\n",
    "ax.scatter(X[:,0].numpy(), X[:,1].numpy(), y.numpy(), c='r', marker='o')\n",
    "\n",
    "ax.set_xlabel('$X_1$')\n",
    "ax.set_ylabel('$X_2$')\n",
    "ax.set_zlabel('$Y$')\n",
    "\n",
    "plt.title('Dataset')\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def visualize_fun(w, title, num_pts=20):\n",
    "    \n",
    "    x1, x2 = np.meshgrid(np.linspace(-2,2, num_pts), np.linspace(-2,2, num_pts))\n",
    "    X_plane = torch.tensor(np.stack([np.reshape(x1, (num_pts**2)), np.reshape(x2, (num_pts**2))], axis=1)).float()\n",
    "    y_plane = np.reshape((X_plane @ w).detach().numpy(), (num_pts, num_pts))\n",
    "    \n",
    "    plt3d = plt.figure().gca(projection='3d')\n",
    "    plt3d.plot_surface(x1, x2, y_plane, alpha=0.2)\n",
    "\n",
    "    ax = plt.gca()\n",
    "    ax.scatter(X[:,0].numpy(), X[:,1].numpy(), y.numpy(), c='r', marker='o')\n",
    "\n",
    "    ax.set_xlabel('$X_1$')\n",
    "    ax.set_ylabel('$X_2$')\n",
    "    ax.set_zlabel('$Y$')\n",
    "    \n",
    "    plt.title(title)\n",
    "    plt.show()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "visualize_fun(true_w, 'Dataset and true $w$')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Sanity check\n",
    "To verify PyTorch is computing the gradients correctly, let's recall the gradient for the RSS objective:\n",
    "\n",
    "$$\\nabla_w \\mathcal{L}_{RSS}(w; X) = \\nabla_w\\frac{1}{n} ||y - Xw||_2^2 = -\\frac{2}{n}X^T(y-Xw)$$\n",
    "\n",
    "Let's see if the match up:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# define a linear model with no bias\n",
    "def model(X, w):\n",
    "    return X @ w\n",
    "\n",
    "# the residual sum of squares loss function\n",
    "def rss(y, y_hat):\n",
    "    return torch.norm(y - y_hat)**2 / n\n",
    "\n",
    "# analytical expression for the gradient\n",
    "def grad_rss(X, y, w):\n",
    "    return -2*X.t() @ (y - X @ w) / n\n",
    "\n",
    "w = torch.tensor([[1.], [0]], requires_grad=True)\n",
    "y_hat = model(X, w)\n",
    "\n",
    "loss = rss(y, y_hat)\n",
    "loss.backward()\n",
    "\n",
    "print('Analytical gradient', grad_rss(X, y, w).detach().view(2).numpy())\n",
    "print('PyTorch\\'s gradient', w.grad.view(2).numpy())\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now that we've seen PyTorch is doing the right think, let's use the gradients!\n",
    "\n",
    "## Linear regression using GD with automatically computed derivatives\n",
    "\n",
    "We will now use the gradients to run the gradient descent algorithm.\n",
    "\n",
    "Note: This example is an illustration to connect ideas we have seen before to PyTorch's way of doing things. We will see how to do this in the \"PyTorchic\" way in the next example."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "step_size = 0.1\n",
    "\n",
    "print('iter,\\tloss,\\tw')\n",
    "for i in range(20):\n",
    "    y_hat = model(X, w)\n",
    "    loss = rss(y, y_hat)\n",
    "    \n",
    "    loss.backward() # compute the gradient of the loss\n",
    "    \n",
    "    w.data = w.data - step_size * w.grad # do a gradient descent step\n",
    "    \n",
    "    print('{},\\t{:.2f},\\t{}'.format(i, loss.item(), w.view(2).detach().numpy()))\n",
    "    \n",
    "    # We need to zero the grad variable since the backward()\n",
    "    # call accumulates the gradients in .grad instead of overwriting.\n",
    "    # The detach_() is for efficiency. You do not need to worry too much about it.\n",
    "    w.grad.detach()\n",
    "    w.grad.zero_()\n",
    "\n",
    "print('\\ntrue w\\t\\t', true_w.view(2).numpy())\n",
    "print('estimated w\\t', w.view(2).detach().numpy())"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "visualize_fun(w, 'Dataset with learned $w$ (Manual GD)')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## torch.nn.Module\n",
    "\n",
    "`Module` is PyTorch's way of performing operations on tensors. Modules are implemented as subclasses of the `torch.nn.Module` class. All modules are callable and can be composed together to create complex functions.\n",
    "\n",
    "[`torch.nn` docs](https://pytorch.org/docs/stable/nn.html)\n",
    "\n",
    "Note: most of the functionality implemented for modules can be accessed in a functional form via `torch.nn.functional`, but these require you to create and manage the weight tensors yourself.\n",
    "\n",
    "[`torch.nn.functional` docs](https://pytorch.org/docs/stable/nn.html#torch-nn-functional).\n",
    "\n",
    "### Linear Module\n",
    "The bread and butter of modules is the Linear module which does a linear transformation with a bias. It takes the input and output dimensions as parameters, and creates the weights in the object.\n",
    "\n",
    "Unlike how we initialized our $w$ manually, the Linear module automatically initializes the weights randomly. For minimizing non convex loss functions (e.g. training neural networks), initialization is important and can affect results. If training isn't working as well as expected, one thing to try is manually initializing the weights to something different from the default. PyTorch implements some common initializations in `torch.nn.init`.\n",
    "\n",
    "[`torch.nn.init` docs](https://pytorch.org/docs/stable/nn.html#torch-nn-init)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "d_in = 3\n",
    "d_out = 4\n",
    "linear_module = nn.Linear(d_in, d_out)\n",
    "\n",
    "example_tensor = torch.tensor([[1.,2,3], [4,5,6]])\n",
    "# applys a linear transformation to the data\n",
    "transformed = linear_module(example_tensor)\n",
    "print('example_tensor', example_tensor.shape)\n",
    "print('transormed', transformed.shape)\n",
    "print()\n",
    "print('We can see that the weights exist in the background\\n')\n",
    "print('W:', linear_module.weight)\n",
    "print('b:', linear_module.bias)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Activation functions\n",
    "PyTorch implements a number of activation functions including but not limited to `ReLU`, `Tanh`, and `Sigmoid`. Since they are modules, they need to be instantiated."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "activation_fn = nn.ReLU() # we instantiate an instance of the ReLU module\n",
    "example_tensor = torch.tensor([-1.0, 1.0, 0.0])\n",
    "activated = activation_fn(example_tensor)\n",
    "print('example_tensor', example_tensor)\n",
    "print('activated', activated)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Sequential\n",
    "\n",
    "Many times, we want to compose Modules together. `torch.nn.Sequential` provides a good interface for composing simple modules."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "d_in = 3\n",
    "d_hidden = 4\n",
    "d_out = 1\n",
    "model = torch.nn.Sequential(\n",
    "                            nn.Linear(d_in, d_hidden),\n",
    "                            nn.Tanh(),\n",
    "                            nn.Linear(d_hidden, d_out),\n",
    "                            nn.Sigmoid()\n",
    "                           )\n",
    "\n",
    "example_tensor = torch.tensor([[1.,2,3],[4,5,6]])\n",
    "transformed = model(example_tensor)\n",
    "print('transformed', transformed.shape)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Note: we can access *all* of the parameters (of any `nn.Module`) with the `parameters()` method. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "params = model.parameters()\n",
    "\n",
    "for param in params:\n",
    "    print(param)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Loss functions\n",
    "PyTorch implements many common loss functions including `MSELoss` and `CrossEntropyLoss`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "mse_loss_fn = nn.MSELoss()\n",
    "\n",
    "input = torch.tensor([[0., 0, 0]])\n",
    "target = torch.tensor([[1., 0, -1]])\n",
    "\n",
    "loss = mse_loss_fn(input, target)\n",
    "\n",
    "print(loss)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## torch.optim\n",
    "PyTorch implements a number of gradient-based optimization methods in `torch.optim`, including Gradient Descent. At the minimum, it takes in the model parameters and a learning rate.\n",
    "\n",
    "Optimizers do not compute the gradients for you, so you must call `backward()` yourself. You also must call the `optim.zero_grad()` function before calling `backward()` since by default PyTorch does and inplace add to the `.grad` member variable rather than overwriting it.\n",
    "\n",
    "This does both the `detach_()` and `zero_()` calls on all tensor's `grad` variables.\n",
    "\n",
    "[`torch.optim` docs](https://pytorch.org/docs/stable/optim.html)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# create a simple model\n",
    "model = nn.Linear(1, 1)\n",
    "\n",
    "# create a simple dataset\n",
    "X_simple = torch.tensor([[1.]])\n",
    "y_simple = torch.tensor([[2.]])\n",
    "\n",
    "# create our optimizer\n",
    "optim = torch.optim.SGD(model.parameters(), lr=1e-2)\n",
    "mse_loss_fn = nn.MSELoss()\n",
    "\n",
    "y_hat = model(X_simple)\n",
    "print('model params before:', model.weight)\n",
    "loss = mse_loss_fn(y_hat, y_simple)\n",
    "optim.zero_grad()\n",
    "loss.backward()\n",
    "optim.step()\n",
    "print('model params after:', model.weight)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "As we can see, the parameter was updated in the correct direction"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Linear regression using GD with automatically computed derivatives and PyTorch's Modules\n",
    "\n",
    "Now let's combine what we've learned to solve linear regression in a \"PyTorchic\" way."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "step_size = 0.1\n",
    "\n",
    "linear_module = nn.Linear(d, 1, bias=False)\n",
    "\n",
    "loss_func = nn.MSELoss()\n",
    "\n",
    "optim = torch.optim.SGD(linear_module.parameters(), lr=step_size)\n",
    "\n",
    "print('iter,\\tloss,\\tw')\n",
    "\n",
    "for i in range(20):\n",
    "    y_hat = linear_module(X)\n",
    "    loss = loss_func(y_hat, y)\n",
    "    optim.zero_grad()\n",
    "    loss.backward()\n",
    "    optim.step()\n",
    "    \n",
    "    print('{},\\t{:.2f},\\t{}'.format(i, loss.item(), linear_module.weight.view(2).detach().numpy()))\n",
    "\n",
    "print('\\ntrue w\\t\\t', true_w.view(2).numpy())\n",
    "print('estimated w\\t', linear_module.weight.view(2).detach().numpy())"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "visualize_fun(linear_module.weight.t(), 'Dataset with learned $w$ (PyTorch GD)')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Linear regression using SGD \n",
    "In the previous examples, we computed the average gradient over the entire dataset (Gradient Descent). We can implement Stochastic Gradient Descent with a simple modification."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "step_size = 0.01\n",
    "\n",
    "linear_module = nn.Linear(d, 1)\n",
    "loss_func = nn.MSELoss()\n",
    "optim = torch.optim.SGD(linear_module.parameters(), lr=step_size)\n",
    "print('iter,\\tloss,\\tw')\n",
    "for i in range(200):\n",
    "    rand_idx = np.random.choice(n) # take a random point from the dataset\n",
    "    x = X[rand_idx] \n",
    "    y_hat = linear_module(x)\n",
    "    loss = loss_func(y_hat, y[rand_idx]) # only compute the loss on the single point\n",
    "    optim.zero_grad()\n",
    "    loss.backward()\n",
    "    optim.step()\n",
    "    \n",
    "    if i % 20 == 0:\n",
    "        print('{},\\t{:.2f},\\t{}'.format(i, loss.item(), linear_module.weight.view(2).detach().numpy()))\n",
    "\n",
    "print('\\ntrue w\\t\\t', true_w.view(2).numpy())\n",
    "print('estimated w\\t', linear_module.weight.view(2).detach().numpy())"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "visualize_fun(linear_module.weight.t(), 'Dataset with learned $w$ (PyTorch SGD)')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Neural Network Basics in PyTorch\n",
    "\n",
    "Let's consider the dataset from hw3. We will try and fit a simple neural network to the data."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%matplotlib inline\n",
    "\n",
    "d = 1\n",
    "n = 200\n",
    "X = torch.rand(n,d)\n",
    "y = 4 * torch.sin(np.pi * X) * torch.cos(6*np.pi*X**2)\n",
    "\n",
    "plt.scatter(X.numpy(), y.numpy())\n",
    "plt.title('plot of $f(x)$')\n",
    "plt.xlabel('$x$')\n",
    "plt.ylabel('$y$')\n",
    "\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Here we define a simple two hidden layer neural network with Tanh activations. There are a few hyper parameters to play with to get a feel for how they change the results."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# feel free to play with these parameters\n",
    "\n",
    "step_size = 0.05\n",
    "n_epochs = 6000\n",
    "n_hidden_1 = 32\n",
    "n_hidden_2 = 32\n",
    "d_out = 1\n",
    "\n",
    "neural_network = nn.Sequential(\n",
    "                            nn.Linear(d, n_hidden_1), \n",
    "                            nn.Tanh(),\n",
    "                            nn.Linear(n_hidden_1, n_hidden_2),\n",
    "                            nn.Tanh(),\n",
    "                            nn.Linear(n_hidden_2, d_out)\n",
    "                            )\n",
    "\n",
    "loss_func = nn.MSELoss()\n",
    "\n",
    "optim = torch.optim.SGD(neural_network.parameters(), lr=step_size)\n",
    "print('iter,\\tloss')\n",
    "for i in range(n_epochs):\n",
    "    y_hat = neural_network(X)\n",
    "    loss = loss_func(y_hat, y)\n",
    "    optim.zero_grad()\n",
    "    loss.backward()\n",
    "    optim.step()\n",
    "    \n",
    "    if i % (n_epochs // 10) == 0:\n",
    "        print('{},\\t{:.2f}'.format(i, loss.item()))\n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "X_grid = torch.from_numpy(np.linspace(0,1,50)).float().view(-1, d)\n",
    "y_hat = neural_network(X_grid)\n",
    "plt.scatter(X.numpy(), y.numpy())\n",
    "plt.plot(X_grid.detach().numpy(), y_hat.detach().numpy(), 'r')\n",
    "plt.title('plot of $f(x)$ and $\\hat{f}(x)$')\n",
    "plt.xlabel('$x$')\n",
    "plt.ylabel('$y$')\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Things that might help on the homework\n",
    "\n",
    "## Brief Sidenote: Momentum\n",
    "\n",
    "There are other optimization algorithms besides stochastic gradient descent. One is a modification of SGD called momentum. We won't get into it here, but if you would like to read more [here](https://distill.pub/2017/momentum/) is a good place to start.\n",
    "\n",
    "We only change the step size and add the momentum keyword argument to the optimizer. Notice how it reduces the training loss in fewer iterations."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# feel free to play with these parameters\n",
    "\n",
    "step_size = 0.05\n",
    "momentum = 0.9\n",
    "n_epochs = 1500\n",
    "n_hidden_1 = 32\n",
    "n_hidden_2 = 32\n",
    "d_out = 1\n",
    "\n",
    "neural_network = nn.Sequential(\n",
    "                            nn.Linear(d, n_hidden_1), \n",
    "                            nn.Tanh(),\n",
    "                            nn.Linear(n_hidden_1, n_hidden_2),\n",
    "                            nn.Tanh(),\n",
    "                            nn.Linear(n_hidden_2, d_out)\n",
    "                            )\n",
    "\n",
    "loss_func = nn.MSELoss()\n",
    "\n",
    "optim = torch.optim.SGD(neural_network.parameters(), lr=step_size, momentum=momentum)\n",
    "print('iter,\\tloss')\n",
    "for i in range(n_epochs):\n",
    "    y_hat = neural_network(X)\n",
    "    loss = loss_func(y_hat, y)\n",
    "    optim.zero_grad()\n",
    "    loss.backward()\n",
    "    optim.step()\n",
    "    \n",
    "    if i % (n_epochs // 10) == 0:\n",
    "        print('{},\\t{:.2f}'.format(i, loss.item()))\n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "X_grid = torch.from_numpy(np.linspace(0,1,50)).float().view(-1, d)\n",
    "y_hat = neural_network(X_grid)\n",
    "plt.scatter(X.numpy(), y.numpy())\n",
    "plt.plot(X_grid.detach().numpy(), y_hat.detach().numpy(), 'r')\n",
    "plt.title('plot of $f(x)$ and $\\hat{f}(x)$')\n",
    "plt.xlabel('$x$')\n",
    "plt.ylabel('$y$')\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## CrossEntropyLoss\n",
    "So far, we have been considering regression tasks and have used the [MSELoss](https://pytorch.org/docs/stable/nn.html#torch.nn.MSELoss) module. For the homework, we will be performing a classification task and will use the cross entropy loss.\n",
    "\n",
    "PyTorch implements a version of the cross entropy loss in one module called [CrossEntropyLoss](https://pytorch.org/docs/stable/nn.html#torch.nn.CrossEntropyLoss). Its usage is slightly different than MSE, so we will break it down here. \n",
    "\n",
    "- input: The first parameter to CrossEntropyLoss is the output of our network. It expects a *real valued* tensor of dimensions $(N,C)$ where $N$ is the minibatch size and $C$ is the number of classes. In our case $N=3$ and $C=2$. The values along the second dimension correspond to raw unnormalized scores for each class. The CrossEntropyLoss module does the softmax calculation for us, so we do not need to apply our own softmax to the output of our neural network.\n",
    "- output: The second parameter to CrossEntropyLoss is the true label. It expects an *integer valued* tensor of dimension $(N)$. The integer at each element corresponds to the correct class. In our case, the \"correct\" class labels are class 0, class 1, and class 1.\n",
    "\n",
    "Try out the loss function on three toy predictions. The true class labels are $y=[1,1,0]$. The first two examples correspond to predictions that are \"correct\" in that they have higher raw scores for the correct class. The second example is \"more confident\" in the prediction, leading to a smaller loss. The last two examples are incorrect predictions with lower and higher confidence respectively."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "loss = nn.CrossEntropyLoss()\n",
    "\n",
    "input = torch.tensor([[-1., 1],[-1, 1],[1, -1]]) # raw scores correspond to the correct class\n",
    "# input = torch.tensor([[-3., 3],[-3, 3],[3, -3]]) # raw scores correspond to the correct class with higher confidence\n",
    "# input = torch.tensor([[1., -1],[1, -1],[-1, 1]]) # raw scores correspond to the incorrect class\n",
    "# input = torch.tensor([[3., -3],[3, -3],[-3, 3]]) # raw scores correspond to the incorrect class with incorrectly placed confidence\n",
    "\n",
    "target = torch.tensor([1, 1, 0])\n",
    "output = loss(input, target)\n",
    "print(output)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Learning rate schedulers\n",
    "\n",
    "Often we do not want to use a fixed learning rate throughout all training. PyTorch offers learning rate schedulers to change the learning rate over time. Common strategies include multiplying the lr by a constant every epoch (e.g. 0.9) and halving the learning rate when the training loss flattens out.\n",
    "\n",
    "See the [learning rate scheduler docs](https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate) for usage and examples"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Convolutions\n",
    "When working with images, we often want to use convolutions to extract features using convolutions. PyTorch implments this for us in the `torch.nn.Conv2d` module. It expects the input to have a specific dimension $(N, C_{in}, H_{in}, W_{in})$ where $N$ is batch size, $C_{in}$ is the number of channels the image has, and $H_{in}, W_{in}$ are the image height and width respectively.\n",
    "\n",
    "We can modify the convolution to have different properties with the parameters:\n",
    "- kernel_size\n",
    "- stride\n",
    "- padding\n",
    "\n",
    "They can change the output dimension so be careful.\n",
    "\n",
    "See the [`torch.nn.Conv2d` docs](https://pytorch.org/docs/stable/nn.html#torch.nn.Conv2d) for more information."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "To illustrate what the `Conv2d` module is doing, let's set the conv weights manually to a Gaussian blur kernel.\n",
    "\n",
    "We can see that it applies the kernel to the image."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# an entire mnist digit\n",
    "image = np.array([0., 0., 0., 0., 0.,0., 0., 0., 0., 0.,0., 0., 0., 0., 0.,0., 0., 0., 0., 0.,0., 0., 0., 0., 0.,0., 0., 0., 0., 0.,0., 0., 0., 0., 0.,0., 0., 0., 0., 0.,0., 0., 0., 0., 0.,0., 0., 0., 0., 0.,0., 0., 0., 0., 0.,0., 0., 0., 0., 0.,0., 0., 0., 0., 0.,0., 0., 0., 0., 0.,0., 0., 0., 0., 0.,0., 0., 0., 0., 0.,0., 0., 0., 0., 0.,0., 0., 0., 0., 0.,0., 0., 0., 0., 0.,0., 0., 0., 0., 0.,0., 0., 0., 0., 0.,0., 0., 0., 0., 0.,0., 0., 0., 0., 0.,0., 0., 0., 0., 0.,0., 0., 0., 0., 0.,0., 0., 0., 0., 0.,0., 0., 0., 0., 0.,0., 0., 0., 0., 0.,0., 0., 0., 0., 0.,0., 0., 0., 0., 0.,0., 0., 0., 0., 0.,0., 0., 0., 0., 0.,0., 0., 0., 0., 0.,0., 0., 0., 0., 0.,0., 0., 0., 0., 0.,0., 0., 0., 0., 0.,0., 0., 0., 0., 0.,0., 0., 0., 0., 0.,0., 0., 0., 0., 0.,0., 0., 0., 0., 0.,0., 0., 0., 0., 0.,0., 0., 0.3803922 , 0.37647063, 0.3019608 ,0.46274513, 0.2392157 , 0., 0., 0.,0., 0., 0., 0., 0.,0., 0., 0., 0., 0.,0., 0., 0.3529412 , 0.5411765 , 0.9215687 ,0.9215687 , 0.9215687 , 0.9215687 , 0.9215687 , 0.9215687 ,0.9843138 , 0.9843138 , 0.9725491 , 0.9960785 , 0.9607844 ,0.9215687 , 0.74509805, 0.08235294, 0., 0.,0., 0., 0., 0., 0.,0., 0., 0., 0., 0.54901963,0.9843138 , 0.9960785 , 0.9960785 , 0.9960785 , 0.9960785 ,0.9960785 , 0.9960785 , 0.9960785 , 0.9960785 , 0.9960785 ,0.9960785 , 0.9960785 , 0.9960785 , 0.9960785 , 0.9960785 ,0.7411765 , 0.09019608, 0., 0., 0.,0., 0., 0., 0., 0.,0., 0., 0.8862746 , 0.9960785 , 0.81568635,0.7803922 , 0.7803922 , 0.7803922 , 0.7803922 , 0.54509807,0.2392157 , 0.2392157 , 0.2392157 , 0.2392157 , 0.2392157 ,0.5019608 , 0.8705883 , 0.9960785 , 0.9960785 , 0.7411765 ,0.08235294, 0., 0., 0., 0.,0., 0., 0., 0., 0.,0.14901961, 0.32156864, 0.0509804 , 0., 0.,0., 0., 0., 0., 0.,0., 0., 0., 0., 0.13333334,0.8352942 , 0.9960785 , 0.9960785 , 0.45098042, 0.,0., 0., 0., 0., 0.,0., 0., 0., 0., 0.,0., 0., 0., 0., 0.,0., 0., 0., 0., 0.,0., 0., 0., 0.32941177, 0.9960785 ,0.9960785 , 0.9176471 , 0., 0., 0.,0., 0., 0., 0., 0.,0., 0., 0., 0., 0.,0., 0., 0., 0., 0.,0., 0., 0., 0., 0.,0., 0.32941177, 0.9960785 , 0.9960785 , 0.9176471 ,0., 0., 0., 0., 0.,0., 0., 0., 0., 0.,0., 0., 0., 0., 0.,0., 0., 0., 0., 0.,0., 0., 0., 0.4156863 , 0.6156863 ,0.9960785 , 0.9960785 , 0.95294124, 0.20000002, 0.,0., 0., 0., 0., 0.,0., 0., 0., 0., 0.,0., 0., 0., 0., 0.,0., 0.09803922, 0.45882356, 0.8941177 , 0.8941177 ,0.8941177 , 0.9921569 , 0.9960785 , 0.9960785 , 0.9960785 ,0.9960785 , 0.94117653, 0., 0., 0.,0., 0., 0., 0., 0.,0., 0., 0., 0., 0.,0., 0., 0.26666668, 0.4666667 , 0.86274517,0.9960785 , 0.9960785 , 0.9960785 , 0.9960785 , 0.9960785 ,0.9960785 , 0.9960785 , 0.9960785 , 0.9960785 , 0.5568628 ,0., 0., 0., 0., 0.,0., 0., 0., 0., 0.,0., 0., 0., 0.14509805, 0.73333335,0.9921569 , 0.9960785 , 0.9960785 , 0.9960785 , 0.8745099 ,0.8078432 , 0.8078432 , 0.29411766, 0.26666668, 0.8431373 ,0.9960785 , 0.9960785 , 0.45882356, 0., 0.,0., 0., 0., 0., 0.,0., 0., 0., 0., 0.,0.4431373 , 0.8588236 , 0.9960785 , 0.9490197 , 0.89019614,0.45098042, 0.34901962, 0.12156864, 0., 0.,0., 0., 0.7843138 , 0.9960785 , 0.9450981 ,0.16078432, 0., 0., 0., 0.,0., 0., 0., 0., 0.,0., 0., 0., 0.6627451 , 0.9960785 ,0.6901961 , 0.24313727, 0., 0., 0.,0., 0., 0., 0., 0.18823531,0.9058824 , 0.9960785 , 0.9176471 , 0., 0.,0., 0., 0., 0., 0.,0., 0., 0., 0., 0.,0., 0.07058824, 0.48627454, 0., 0.,0., 0., 0., 0., 0.,0., 0., 0.32941177, 0.9960785 , 0.9960785 ,0.6509804 , 0., 0., 0., 0.,0., 0., 0., 0., 0.,0., 0., 0., 0., 0.,0., 0., 0., 0., 0.,0., 0., 0., 0., 0.,0.54509807, 0.9960785 , 0.9333334 , 0.22352943, 0.,0., 0., 0., 0., 0.,0., 0., 0., 0., 0.,0., 0., 0., 0., 0.,0., 0., 0., 0., 0.,0., 0., 0.8235295 , 0.9803922 , 0.9960785 ,0.65882355, 0., 0., 0., 0.,0., 0., 0., 0., 0.,0., 0., 0., 0., 0.,0., 0., 0., 0., 0.,0., 0., 0., 0., 0.,0.9490197 , 0.9960785 , 0.93725497, 0.22352943, 0.,0., 0., 0., 0., 0.,0., 0., 0., 0., 0.,0., 0., 0., 0., 0.,0., 0., 0., 0., 0.,0., 0., 0.34901962, 0.9843138 , 0.9450981 ,0.3372549 , 0., 0., 0., 0.,0., 0., 0., 0., 0.,0., 0., 0., 0., 0.,0., 0., 0., 0., 0.,0., 0., 0., 0., 0.01960784,0.8078432 , 0.96470594, 0.6156863 , 0., 0.,0., 0., 0., 0., 0.,0., 0., 0., 0., 0.,0., 0., 0., 0., 0.,0., 0., 0., 0., 0.,0., 0., 0.01568628, 0.45882356, 0.27058825,0., 0., 0., 0., 0.,0., 0., 0., 0., 0.,0., 0., 0., 0., 0.,0., 0., 0., 0., 0.,0., 0., 0., 0., 0.,0., 0., 0., 0., 0.,0., 0., 0., 0., 0.,0., 0., 0., 0.], dtype=np.float32)\n",
    "image_torch = torch.from_numpy(image).view(1, 1, 28, 28)\n",
    "\n",
    "# a gaussian blur kernel\n",
    "gaussian_kernel = torch.tensor([[1., 2, 1],[2, 4, 2],[1, 2, 1]]) / 16.0\n",
    "\n",
    "conv = nn.Conv2d(1, 1, 3)\n",
    "# manually set the conv weight\n",
    "conv.weight.data[:] = gaussian_kernel\n",
    "\n",
    "convolved = conv(image_torch)\n",
    "\n",
    "plt.title('original image')\n",
    "plt.imshow(image_torch.view(28,28).detach().numpy())\n",
    "plt.show()\n",
    "\n",
    "plt.title('blurred image')\n",
    "plt.imshow(convolved.view(26,26).detach().numpy())\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "As we can see, the image is blurred as expected. \n",
    "\n",
    "In practice, we learn many kernels at a time. In this example, we take in an RGB image (3 channels) and output a 16 channel image. After an activation function, that could be used as input to another `Conv2d` module."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "im_channels = 3 # if we are working with RGB images, there are 3 input channels, with black and white, 1\n",
    "out_channels = 16 # this is a hyperparameter we can tune\n",
    "kernel_size = 3 # this is another hyperparameter we can tune\n",
    "batch_size = 4\n",
    "image_width = 32\n",
    "image_height = 32\n",
    "\n",
    "im = torch.randn(batch_size, im_channels, image_width, image_height)\n",
    "\n",
    "m = nn.Conv2d(im_channels, out_channels, kernel_size)\n",
    "convolved = m(im) # it is a module so we can call it\n",
    "\n",
    "print('im shape', im.shape)\n",
    "print('convolved im shape', convolved.shape)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Useful links:\n",
    "- [60 minute PyTorch Tutorial](https://pytorch.org/tutorials/beginner/deep_learning_60min_blitz.html)\n",
    "- [PyTorch Docs](https://pytorch.org/docs/stable/index.html)\n",
    "- [Lecture notes on Auto-Diff](https://courses.cs.washington.edu/courses/cse446/19wi/notes/auto-diff.pdf)\n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.5.2"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}