{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Overview:\n",
    "We will try to convince you the importance that our assumed model is the same as the \"true\" model"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "from matplotlib import pyplot as plt\n",
    "np.random.seed(6)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We have a set of $d$-dimensional data (ignore bias term) $\\{x_i \\in \\mathbb{R}^d \\}_{i=1}^n$ and its label $y$ are generated by the \"true\" function $f(x) = w^T x $. For the purpose of simplicity, we use $d = 1$\n",
    "\n",
    "However each $y$ are observed after it was **corrupted** by a noise $\\epsilon$, due to the imperfection of \"channel\" through which the data is transmitted to us."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [],
   "source": [
    "# define weight w and a linear function (model)\n",
    "w = np.array([[3]]) # w.shape = (d, 1)\n",
    "# print(w)\n",
    "\n",
    "f = lambda x, epsilon: x @ w + epsilon\n",
    "\n",
    "# we stack all x_i together and call it X\n",
    "X_train = np.array([np.arange(0, 40, 1)]).T  # create a bunch of data points of dimension 1\n",
    "X_test = np.array([np.arange(-20, 0, 1)]).T  # create a bunch of data points of dimension 1\n",
    "# print(X_train)  # a 40x1 vector\n",
    "# print(X_test)  # a 20x1 vector\n",
    "\n",
    "n_train = len(X_train)\n",
    "n_test = len(X_test)\n",
    "# print(n_train)  # 40\n",
    "# print(n_test)  # 20\n",
    "\n",
    "# define Mean Absolute Error (MAE) loss function\n",
    "mae_loss = lambda y, y_hat: np.mean(np.absolute(y - y_hat))\n",
    "\n",
    "# define Mean Square Error (MSE) loss function\n",
    "mse_loss = lambda y, y_hat: np.linalg.norm(y - y_hat) / len(y)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now, assume the noise are from normal and laplacian distribution and then we can generate the true label from the true function"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [],
   "source": [
    "# epsilon ~ N(0, 1)  (mu = 0, simga = 1)\n",
    "# x^T w + epsilon ~ N(x^T w, 1)\n",
    "gaussian_noise_train = np.random.standard_normal((n_train, 1)) \n",
    "gaussian_noise_test = np.random.standard_normal((n_test, 1)) \n",
    "# print(gaussian_noise_train)  # 40 values\n",
    "# print(gaussian_noise_test)  # 20 values\n",
    "\n",
    "# epsilon ~ Laplace(0, 1)  (mu = 0, b/sigma = 1)\n",
    "# x^T w + epsilon ~ Laplace(x^T w, 1)\n",
    "laplacian_noise_train = np.random.laplace(0, 1, (n_train, 1)) \n",
    "laplacian_noise_test = np.random.laplace(0, 1, (n_test, 1)) \n",
    "# print(laplacian_noise_train)\n",
    "# print(laplacian_noise_test)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {},
   "outputs": [],
   "source": [
    "# calculate the true label for gaussian data\n",
    "y_gauss_train = f(X_train, gaussian_noise_train)\n",
    "y_gauss_test = f(X_test, gaussian_noise_test)\n",
    "# print(y_gauss_train)\n",
    "# print(y_gauss_test)\n",
    "\n",
    "# calculate the true label for laplacian data\n",
    "y_laplace_train = f(X_train, laplacian_noise_train)\n",
    "y_laplace_test = f(X_test, laplacian_noise_test)\n",
    "# print(y_laplace_train)\n",
    "# print(y_laplace_test)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Define formula for calculating weights for two models  \n",
    "\n",
    "**P.S.** (beyond scope of the course) regression with MAE loss happens to be a special case of Quatile Regression. Don't worry about it."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "metadata": {},
   "outputs": [],
   "source": [
    "from statsmodels.regression.quantile_regression import QuantReg\n",
    "gaussian_w = lambda X, y: np.linalg.inv(X.T @ X) @ X.T @ y\n",
    "laplace_w = lambda X, y: QuantReg(exog=X, endog=y).fit(q=0.5)  "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now we fit the **gaussian model** on data with **gaussian noise** and calculate for errors"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Train MAE: 0.6589226177090796\n",
      "Train MSE: 0.12636001244139547\n",
      "\n",
      "Train MAE: 0.6589226177090796\n",
      "Test MSE: 0.20344548891619837\n"
     ]
    }
   ],
   "source": [
    "# solve for weight\n",
    "w_hat = gaussian_w(X_train, y_gauss_train)\n",
    "\n",
    "# make prediction\n",
    "y_hat_train = X_train @ w_hat\n",
    "\n",
    "print(f\"Train MAE: {mae_loss(y_gauss_train, y_hat_train)}\")\n",
    "print(f\"Train MSE: {mse_loss(y_gauss_train, y_hat_train)}\")\n",
    "print()\n",
    "\n",
    "y_hat_test = X_test @ w_hat\n",
    "print(f\"Train MAE: {mae_loss(y_gauss_train, y_hat_train)}\")\n",
    "print(f\"Test MSE: {mse_loss(y_gauss_test, y_hat_test)}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now we fit the **gaussian model** on data with **laplacian noise** and calculate for errors"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 24,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Train MAE: 1.0413590095438434\n",
      "Train MSE: 0.287969371414439\n",
      "\n",
      "Test MAE: 0.9884717643878697\n",
      "Test MSE: 0.2674524753762927\n"
     ]
    }
   ],
   "source": [
    "# solve for weight\n",
    "w_hat = gaussian_w(X_train, y_laplace_train) # notice the difference from the previous code block\n",
    "\n",
    "# make prediction\n",
    "y_hat_train = X_train @ w_hat\n",
    "\n",
    "print(f\"Train MAE: {mae_loss(y_laplace_train, y_hat_train)}\")\n",
    "print(f\"Train MSE: {mse_loss(y_laplace_train, y_hat_train)}\")\n",
    "print()\n",
    "\n",
    "y_hat_test = X_test @ w_hat\n",
    "print(f\"Test MAE: {mae_loss(y_laplace_test, y_hat_test)}\")\n",
    "print(f\"Test MSE: {mse_loss(y_laplace_test, y_hat_test)}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now we fit the **laplacian model** on data with **laplacian noise** and calculate for errors"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 25,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Train MAE: 39.8637892658077\n",
      "Train MSE: 48.82401497379753\n",
      "\n",
      "Test MAE: 19.883051640534287\n",
      "Test MSE: 24.351358328904148\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "/anaconda3/lib/python3.7/site-packages/scipy/stats/stats.py:1713: FutureWarning: Using a non-tuple sequence for multidimensional indexing is deprecated; use `arr[tuple(seq)]` instead of `arr[seq]`. In the future this will be interpreted as an array index, `arr[np.array(seq)]`, which will result either in an error or a different result.\n",
      "  return np.add.reduce(sorted[indexer] * weights, axis=axis) / sumval\n"
     ]
    }
   ],
   "source": [
    "from statsmodels.regression.quantile_regression import QuantReg\n",
    "model = laplace_w(X_train, y_laplace_train)\n",
    "\n",
    "y_hat_train = model.predict(X_train)\n",
    "print(f\"Train MAE: {mae_loss(y_laplace_train, y_hat_train)}\")\n",
    "print(f\"Train MSE: {mse_loss(y_laplace_train, y_hat_train)}\")\n",
    "print()\n",
    "\n",
    "y_hat_test = model.predict(X_test)\n",
    "print(f\"Test MAE: {mae_loss(y_laplace_test, y_hat_test)}\")\n",
    "print(f\"Test MSE: {mse_loss(y_laplace_test, y_hat_test)}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now we fit the **laplacian model** on data with **gaussian noise** and calculate for errors"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Train MAE: 39.94617969512774\n",
      "Train MSE: 48.92509187140014\n",
      "\n",
      "Test MAE: 20.14784875618783\n",
      "Test MSE: 24.674018268826366\n"
     ]
    }
   ],
   "source": [
    "model = laplace_w(X_train, y_gauss_train)\n",
    "\n",
    "y_hat_train = model.predict(X_train)\n",
    "print(f\"Train MAE: {mae_loss(y_gauss_train, y_hat_train)}\")\n",
    "print(f\"Train MSE: {mse_loss(y_gauss_train, y_hat_train)}\")\n",
    "print()\n",
    "\n",
    "y_hat_test = model.predict(X_test)\n",
    "print(f\"Test MAE: {mae_loss(y_gauss_test, y_hat_test)}\")\n",
    "print(f\"Test MSE: {mse_loss(y_gauss_test, y_hat_test)}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Note:** we may observe non-obvious difference between test error or model based on wrong assumption may seems better, some arguments are:\n",
    "- the variance are not differed by much\n",
    "- the data sample is not big either.\n",
    "- the data are just one dimensional; once we are in higher dimension, we will have the \"curse of dimensionality\"\n",
    "- etc."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.1"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}