{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "### Overview:\n", "We will try to convince you the importance that our assumed model is the same as the \"true\" model" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "import numpy as np\n", "from matplotlib import pyplot as plt\n", "np.random.seed(6)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We have a set of $d$-dimensional data (ignore bias term) $\\{x_i \\in \\mathbb{R}^d \\}_{i=1}^n$ and its label $y$ are generated by the \"true\" function $f(x) = w^T x $. For the purpose of simplicity, we use $d = 1$\n", "\n", "However each $y$ are observed after it was **corrupted** by a noise $\\epsilon$, due to the imperfection of \"channel\" through which the data is transmitted to us." ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [], "source": [ "# define weight w and a linear function (model)\n", "w = np.array([[3]]) # w.shape = (d, 1)\n", "# print(w)\n", "\n", "f = lambda x, epsilon: x @ w + epsilon\n", "\n", "# we stack all x_i together and call it X\n", "X_train = np.array([np.arange(0, 40, 1)]).T # create a bunch of data points of dimension 1\n", "X_test = np.array([np.arange(-20, 0, 1)]).T # create a bunch of data points of dimension 1\n", "# print(X_train) # a 40x1 vector\n", "# print(X_test) # a 20x1 vector\n", "\n", "n_train = len(X_train)\n", "n_test = len(X_test)\n", "# print(n_train) # 40\n", "# print(n_test) # 20\n", "\n", "# define Mean Absolute Error (MAE) loss function\n", "mae_loss = lambda y, y_hat: np.mean(np.absolute(y - y_hat))\n", "\n", "# define Mean Square Error (MSE) loss function\n", "mse_loss = lambda y, y_hat: np.linalg.norm(y - y_hat) / len(y)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now, assume the noise are from normal and laplacian distribution and then we can generate the true label from the true function" ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [], "source": [ "# epsilon ~ N(0, 1) (mu = 0, simga = 1)\n", "# x^T w + epsilon ~ N(x^T w, 1)\n", "gaussian_noise_train = np.random.standard_normal((n_train, 1)) \n", "gaussian_noise_test = np.random.standard_normal((n_test, 1)) \n", "# print(gaussian_noise_train) # 40 values\n", "# print(gaussian_noise_test) # 20 values\n", "\n", "# epsilon ~ Laplace(0, 1) (mu = 0, b/sigma = 1)\n", "# x^T w + epsilon ~ Laplace(x^T w, 1)\n", "laplacian_noise_train = np.random.laplace(0, 1, (n_train, 1)) \n", "laplacian_noise_test = np.random.laplace(0, 1, (n_test, 1)) \n", "# print(laplacian_noise_train)\n", "# print(laplacian_noise_test)" ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [], "source": [ "# calculate the true label for gaussian data\n", "y_gauss_train = f(X_train, gaussian_noise_train)\n", "y_gauss_test = f(X_test, gaussian_noise_test)\n", "# print(y_gauss_train)\n", "# print(y_gauss_test)\n", "\n", "# calculate the true label for laplacian data\n", "y_laplace_train = f(X_train, laplacian_noise_train)\n", "y_laplace_test = f(X_test, laplacian_noise_test)\n", "# print(y_laplace_train)\n", "# print(y_laplace_test)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Define formula for calculating weights for two models \n", "\n", "**P.S.** (beyond scope of the course) regression with MAE loss happens to be a special case of Quatile Regression. Don't worry about it." ] }, { "cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [], "source": [ "from statsmodels.regression.quantile_regression import QuantReg\n", "gaussian_w = lambda X, y: np.linalg.inv(X.T @ X) @ X.T @ y\n", "laplace_w = lambda X, y: QuantReg(exog=X, endog=y).fit(q=0.5) " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now we fit the **gaussian model** on data with **gaussian noise** and calculate for errors" ] }, { "cell_type": "code", "execution_count": 23, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Train MAE: 0.6589226177090796\n", "Train MSE: 0.12636001244139547\n", "\n", "Train MAE: 0.6589226177090796\n", "Test MSE: 0.20344548891619837\n" ] } ], "source": [ "# solve for weight\n", "w_hat = gaussian_w(X_train, y_gauss_train)\n", "\n", "# make prediction\n", "y_hat_train = X_train @ w_hat\n", "\n", "print(f\"Train MAE: {mae_loss(y_gauss_train, y_hat_train)}\")\n", "print(f\"Train MSE: {mse_loss(y_gauss_train, y_hat_train)}\")\n", "print()\n", "\n", "y_hat_test = X_test @ w_hat\n", "print(f\"Train MAE: {mae_loss(y_gauss_train, y_hat_train)}\")\n", "print(f\"Test MSE: {mse_loss(y_gauss_test, y_hat_test)}\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now we fit the **gaussian model** on data with **laplacian noise** and calculate for errors" ] }, { "cell_type": "code", "execution_count": 24, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Train MAE: 1.0413590095438434\n", "Train MSE: 0.287969371414439\n", "\n", "Test MAE: 0.9884717643878697\n", "Test MSE: 0.2674524753762927\n" ] } ], "source": [ "# solve for weight\n", "w_hat = gaussian_w(X_train, y_laplace_train) # notice the difference from the previous code block\n", "\n", "# make prediction\n", "y_hat_train = X_train @ w_hat\n", "\n", "print(f\"Train MAE: {mae_loss(y_laplace_train, y_hat_train)}\")\n", "print(f\"Train MSE: {mse_loss(y_laplace_train, y_hat_train)}\")\n", "print()\n", "\n", "y_hat_test = X_test @ w_hat\n", "print(f\"Test MAE: {mae_loss(y_laplace_test, y_hat_test)}\")\n", "print(f\"Test MSE: {mse_loss(y_laplace_test, y_hat_test)}\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now we fit the **laplacian model** on data with **laplacian noise** and calculate for errors" ] }, { "cell_type": "code", "execution_count": 25, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Train MAE: 39.8637892658077\n", "Train MSE: 48.82401497379753\n", "\n", "Test MAE: 19.883051640534287\n", "Test MSE: 24.351358328904148\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "/anaconda3/lib/python3.7/site-packages/scipy/stats/stats.py:1713: FutureWarning: Using a non-tuple sequence for multidimensional indexing is deprecated; use `arr[tuple(seq)]` instead of `arr[seq]`. In the future this will be interpreted as an array index, `arr[np.array(seq)]`, which will result either in an error or a different result.\n", " return np.add.reduce(sorted[indexer] * weights, axis=axis) / sumval\n" ] } ], "source": [ "from statsmodels.regression.quantile_regression import QuantReg\n", "model = laplace_w(X_train, y_laplace_train)\n", "\n", "y_hat_train = model.predict(X_train)\n", "print(f\"Train MAE: {mae_loss(y_laplace_train, y_hat_train)}\")\n", "print(f\"Train MSE: {mse_loss(y_laplace_train, y_hat_train)}\")\n", "print()\n", "\n", "y_hat_test = model.predict(X_test)\n", "print(f\"Test MAE: {mae_loss(y_laplace_test, y_hat_test)}\")\n", "print(f\"Test MSE: {mse_loss(y_laplace_test, y_hat_test)}\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now we fit the **laplacian model** on data with **gaussian noise** and calculate for errors" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Train MAE: 39.94617969512774\n", "Train MSE: 48.92509187140014\n", "\n", "Test MAE: 20.14784875618783\n", "Test MSE: 24.674018268826366\n" ] } ], "source": [ "model = laplace_w(X_train, y_gauss_train)\n", "\n", "y_hat_train = model.predict(X_train)\n", "print(f\"Train MAE: {mae_loss(y_gauss_train, y_hat_train)}\")\n", "print(f\"Train MSE: {mse_loss(y_gauss_train, y_hat_train)}\")\n", "print()\n", "\n", "y_hat_test = model.predict(X_test)\n", "print(f\"Test MAE: {mae_loss(y_gauss_test, y_hat_test)}\")\n", "print(f\"Test MSE: {mse_loss(y_gauss_test, y_hat_test)}\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Note:** we may observe non-obvious difference between test error or model based on wrong assumption may seems better, some arguments are:\n", "- the variance are not differed by much\n", "- the data sample is not big either.\n", "- the data are just one dimensional; once we are in higher dimension, we will have the \"curse of dimensionality\"\n", "- etc." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.1" } }, "nbformat": 4, "nbformat_minor": 2 }