{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Where does error come from?" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In lecture, we discussed how the source of error comes from one of three places\n", "* **Irreducible Error:** The noise from the dataset. In our model, `y_i = f(x_i) + epsilon_i`, and it is hopeless to learn the `epsilon_i` since its random and does not depend on the input\n", "* **Bias:** How much our expected learned model (expectation is over all possible training sets) differs from the underlying model.\n", "* **Variance:** How dependent the learned model is on the particular dataset it was learned on. If slightly changing the dataset leads to a huge change in learned model, then variance is high.\n", "\n", "There is a fundamental tradeoff between bias and variance that depends on how complex your model is. Very simple models (constant functions) have high bias since your true function is usually not constant, but have low variance since they generally don't have the complexity to fit the noise of the specific dataset you got. Very complex models (high degree polynomials) have low bias since you can get a descent approximation of the true function **in expectation**, but have high variance since they have the capabilities of fitting the noise in the data.\n", "\n", "This notebook has some code examples to demonstrate how this bias variance tradeoff occurs with different model complexities using synthetic data." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Are our datasets fixed?" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We first start by defining functions to generate synthetic data from some true function `f`. Remember that as machine learning experts, our goal is to learn the parameters for the function `f` (the weights for the polynomial) when we don't have access to the true function. All we get access to is one random training set drawn from the distribution of inputs (some house square footages are more likely than others) with associated values that are from the true function `f` with some added noise." ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "import math\n", "import random \n", "\n", "import turicreate as tc\n", "\n", "# You don't need to know how to use these libraries, they just make code easier to write\n", "import numpy as np\n", "import matplotlib.pyplot as plt" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "First we define functions to help us generate our data and to generate our features from the inputs. The `generate_data` function will generate `x` values uniformly at random in [0, 1] and then assign the values using the function `f` and Gaussian noise with mean 0 and variance 0.1. " ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "def f(x):\n", " \"\"\"\n", " True function f(x) (unknown to us)\n", " \"\"\"\n", " return 2 * x - 1.4 * x ** 2\n", "\n", "\n", "def generate_data(n):\n", " \"\"\"\n", " Creates and returns a SFrame with n randomly drawn examples\n", " \"\"\"\n", " xs = [random.uniform(0, 1) for _ in range(n)] # generate n numbers uniform at random from [0, 1]\n", " ys = [f(x) + random.gauss(0, 0.1) for x in xs] # evaluate f at each x and add Gaussian noise (mean=0, variance=.1)\n", " return tc.SFrame({'x': xs, 'y': ys})\n", "\n", "\n", "def polynomial_features(data, col, deg):\n", " \"\"\"\n", " Given a dataset, creates a polynomial expansion of the input with \n", " the given name to the given degree.\n", " \n", " Returns the dataset and the list of column names\n", " \"\"\"\n", " data_copy = data.copy()\n", " if deg == 0:\n", " data_copy[col + '0'] = 0\n", " columns = [col + '0']\n", " else:\n", " columns = []\n", " for i in range(1, deg + 1): # +1 to include deg\n", " col_name = col + str(i)\n", " data_copy[col_name] = data_copy[col] ** i\n", " columns.append(col_name)\n", " \n", " return data_copy, columns" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Below is an example dataset we might obvserve. " ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "['x1', 'x2', 'x3']\n" ] }, { "data": { "text/html": [ "
x | \n", "y | \n", "x1 | \n", "x2 | \n", "x3 | \n", "
---|---|---|---|---|
0.965456186496 | \n", "0.673184962247 | \n", "0.965456186496 | \n", "0.932105648042 | \n", "0.89990716437 | \n", "
0.33925171336 | \n", "0.586799151196 | \n", "0.33925171336 | \n", "0.115091725018 | \n", "0.0390450649059 | \n", "
0.642249328449 | \n", "0.591406436346 | \n", "0.642249328449 | \n", "0.412484199894 | \n", "0.264917700378 | \n", "
0.550551028176 | \n", "0.793143265582 | \n", "0.550551028176 | \n", "0.303106434625 | \n", "0.16687555923 | \n", "
0.964080485265 | \n", "0.613866674785 | \n", "0.964080485265 | \n", "0.929451182069 | \n", "0.89606574664 | \n", "
0.452910554595 | \n", "0.608035392843 | \n", "0.452910554595 | \n", "0.205127970464 | \n", "0.0929046228658 | \n", "
0.440052852927 | \n", "0.732238948239 | \n", "0.440052852927 | \n", "0.193646513369 | \n", "0.0852147006672 | \n", "
0.979798162556 | \n", "0.790497632504 | \n", "0.979798162556 | \n", "0.960004439348 | \n", "0.940610585719 | \n", "
0.023278071278 | \n", "-0.0879068202359 | \n", "0.023278071278 | \n", "0.000541868602424 | \n", "1.26136559505e-05 | \n", "
0.377110874102 | \n", "0.430799538281 | \n", "0.377110874102 | \n", "0.142212611366 | \n", "0.0536299221805 | \n", "