Linear regression¶

we want to solve linear regression $${\rm minimize}_w \|{\bf X}w-{\bf y}\|_2^2$$ first generate input, output data examples

there is a subtlety (in notations mainly) in how we handle constant

given data $x=(x[1],x[2],x[3],\cdots,x[d])$ in $d$-dimensions, one option is to use linear predictor of the form $$f(x) = w^Tx = w_1x[1] + w_2x[2] + \cdots, w_dx[d]$$

anogher option is to use affine predictor, by appending a one to the data, i.e. $x=(1,x[1],x[2],x[3],\cdots,x[d])$ $$f(x) = w^Tx = w_0 + w_1x[1] + \cdots + w_d x[d]$$

We will use these notations interchangeably, as it should be clear from the context which one we are using

# generate training data
import numpy as np
n = 100 # sample size
x = np.random.uniform(-1,1,n) #sample input data
x = np.sort(x)
x[0]=-1
x[n-1]=1

# ground truth is 5-th order polynomial and we add Gaussian noise to it
y = (x-.99)*(x-.4)*(x-.25)*(x+.6)*(x+.8) + .03*np.random.randn(n)

# plot the samples and grounstruth
t = np.linspace(-1,1,100)
y0 = (t-.99)*(t-.4)*(t-.25)*(t+.6)*(t+.8)
import matplotlib.pyplot as plt
plt.plot(x,y,'o')
plt.plot(t,y0,'k-')
plt.show()

# generate test data
n_ = 100
x_ = np.random.uniform(-1,1,n_)
y_ = (x_-.99)*(x_-.4)*(x_-.25)*(x_+.6)*(x_+.8) + .03*np.random.randn(n_)

next, create data matrix ${\bf X}$ $${\bf X} = \begin{bmatrix} (x_1)^T\\ (x_2)^T\\ \vdots \\ (x_n)^T \end{bmatrix}$$

the $i$-th row od ${\bf X}$ is input raw data of $i$-th sample, transposed
the $j$-th column of ${\bf X}$ gives values of $j$-th data entry across $n$ data points
$X_{ij}$ is the value of $j$-th entry for $i$-th sample data point

Constant fit example¶

we are going to ignore any input data and try to fit a constanf function $f(x)=w_1$

we create a data matrix with $x_i=1$

data matrix is

# create a data matrix X
X = np.vstack([np.ones(len(x))]).T
display(X)

array([[1.],
       [1.],
       [1.],
       [1.],
       [1.],
       [1.],
       [1.],
       [1.],
       [1.],
       [1.],
       [1.],
       [1.],
       [1.],
       [1.],
       [1.],
       [1.],
       [1.],
       [1.],
       [1.],
       [1.],
       [1.],
       [1.],
       [1.],
       [1.],
       [1.],
       [1.],
       [1.],
       [1.],
       [1.],
       [1.],
       [1.],
       [1.],
       [1.],
       [1.],
       [1.],
       [1.],
       [1.],
       [1.],
       [1.],
       [1.],
       [1.],
       [1.],
       [1.],
       [1.],
       [1.],
       [1.],
       [1.],
       [1.],
       [1.],
       [1.],
       [1.],
       [1.],
       [1.],
       [1.],
       [1.],
       [1.],
       [1.],
       [1.],
       [1.],
       [1.],
       [1.],
       [1.],
       [1.],
       [1.],
       [1.],
       [1.],
       [1.],
       [1.],
       [1.],
       [1.],
       [1.],
       [1.],
       [1.],
       [1.],
       [1.],
       [1.],
       [1.],
       [1.],
       [1.],
       [1.],
       [1.],
       [1.],
       [1.],
       [1.],
       [1.],
       [1.],
       [1.],
       [1.],
       [1.],
       [1.],
       [1.],
       [1.],
       [1.],
       [1.],
       [1.],
       [1.],
       [1.],
       [1.],
       [1.],
       [1.]])

Recall that the least squares solution is \begin{eqnarray} \hat{w}_{\rm LS} &=& (X^TX)^{-1}X^T y\\ &=& n^{-1} 1^T y \\ &=& \frac{1}{n}\sum_{i=1}^n y_i \end{eqnarray}

which is just the average of the sample outcome data, and the prediction is $\hat{w}_{\rm LS} = \frac{1}{n}\sum_{i=1}^n y_i$ for all data point $x$,

$$ \hat{y} = \begin{bmatrix} \hat{y}_1\\ \vdots\\ \hat{y}_n \end{bmatrix} = \begin{bmatrix} \hat{w}_{\rm LS}^Tx_1 \\ \vdots \\ \hat{w}_{\rm LS}^T x_n\end{bmatrix} = \hat{w}_{\rm LS} \begin{bmatrix}1\\ \vdots\\ 1\end{bmatrix}$$

hence, the average is the best constant fit for suqared loss

and the MSE is the variance of the outcome:

$$ \frac1n \sum_{i=1}^n ({\rm ave}(y)-y_i)^2$$

# linear regression with constant feature
w = np.matmul(np.matmul(np.linalg.inv(np.matmul(X.T,X)),X.T),y)
yh =  w*X

plt.plot(x,y,'o')
plt.plot(t,y0,'k-')
plt.plot(x, yh, 'g')
plt.show()

MSE_train = (1/float(n))*np.sum((y - yh)**2)
display(MSE_train)

# MSE on fresh data not used in training
X_ = np.vstack([np.ones(len(x_))]).T
yh_ =  w*X_
MSE_test  = (1/float(n_))*np.sum((y_ - yh_)**2)

0.2598361650255093

for affine fit, we use $(1,x_i)$ as the input data, and the data matrix ${\bf X}$ is

# create a data matrix X
X = np.vstack([np.ones(len(x)),x]).T
display(X)

array([[ 1.        , -1.        ],
       [ 1.        , -0.99328663],
       [ 1.        , -0.93719294],
       [ 1.        , -0.89360246],
       [ 1.        , -0.86658061],
       [ 1.        , -0.8584034 ],
       [ 1.        , -0.84089243],
       [ 1.        , -0.81908021],
       [ 1.        , -0.80943821],
       [ 1.        , -0.73017827],
       [ 1.        , -0.71490174],
       [ 1.        , -0.66460206],
       [ 1.        , -0.59665165],
       [ 1.        , -0.59428837],
       [ 1.        , -0.53133723],
       [ 1.        , -0.52793056],
       [ 1.        , -0.47426817],
       [ 1.        , -0.45586961],
       [ 1.        , -0.42435291],
       [ 1.        , -0.41215125],
       [ 1.        , -0.41028971],
       [ 1.        , -0.39562873],
       [ 1.        , -0.38811145],
       [ 1.        , -0.36513461],
       [ 1.        , -0.32660409],
       [ 1.        , -0.32432334],
       [ 1.        , -0.32036893],
       [ 1.        , -0.31846632],
       [ 1.        , -0.29106553],
       [ 1.        , -0.29080446],
       [ 1.        , -0.29062374],
       [ 1.        , -0.27851353],
       [ 1.        , -0.26963835],
       [ 1.        , -0.26662522],
       [ 1.        , -0.25577565],
       [ 1.        , -0.2401871 ],
       [ 1.        , -0.23988007],
       [ 1.        , -0.23348715],
       [ 1.        , -0.22322156],
       [ 1.        , -0.20691318],
       [ 1.        , -0.16405505],
       [ 1.        , -0.15464408],
       [ 1.        , -0.090944  ],
       [ 1.        , -0.08551193],
       [ 1.        , -0.07722283],
       [ 1.        , -0.07470739],
       [ 1.        , -0.03896271],
       [ 1.        , -0.0381165 ],
       [ 1.        , -0.03684277],
       [ 1.        , -0.03549469],
       [ 1.        , -0.02681647],
       [ 1.        ,  0.01756142],
       [ 1.        ,  0.04970034],
       [ 1.        ,  0.06166236],
       [ 1.        ,  0.07118171],
       [ 1.        ,  0.08801551],
       [ 1.        ,  0.15193429],
       [ 1.        ,  0.17105251],
       [ 1.        ,  0.18408784],
       [ 1.        ,  0.1849882 ],
       [ 1.        ,  0.20748863],
       [ 1.        ,  0.22010039],
       [ 1.        ,  0.25096797],
       [ 1.        ,  0.25237552],
       [ 1.        ,  0.26665165],
       [ 1.        ,  0.31025199],
       [ 1.        ,  0.32229674],
       [ 1.        ,  0.35296593],
       [ 1.        ,  0.3765849 ],
       [ 1.        ,  0.39976059],
       [ 1.        ,  0.40751661],
       [ 1.        ,  0.4435275 ],
       [ 1.        ,  0.44383196],
       [ 1.        ,  0.44863764],
       [ 1.        ,  0.47268276],
       [ 1.        ,  0.51165268],
       [ 1.        ,  0.51756837],
       [ 1.        ,  0.52470369],
       [ 1.        ,  0.54234739],
       [ 1.        ,  0.54904714],
       [ 1.        ,  0.5703933 ],
       [ 1.        ,  0.61015098],
       [ 1.        ,  0.61302179],
       [ 1.        ,  0.61640394],
       [ 1.        ,  0.65557567],
       [ 1.        ,  0.67240494],
       [ 1.        ,  0.71273998],
       [ 1.        ,  0.71289723],
       [ 1.        ,  0.73814258],
       [ 1.        ,  0.74410781],
       [ 1.        ,  0.75493867],
       [ 1.        ,  0.85095778],
       [ 1.        ,  0.8951696 ],
       [ 1.        ,  0.90558563],
       [ 1.        ,  0.90932698],
       [ 1.        ,  0.92182865],
       [ 1.        ,  0.94624166],
       [ 1.        ,  0.95296126],
       [ 1.        ,  0.98843755],
       [ 1.        ,  1.        ]])

the prediction has the form $$\hat{y} = w_1+w_2x$$ this is called straight-line fit and if $x$ is time, it is called trend line

# linear regression
p=1
w1 = np.array(2)
w1 = np.matmul(np.matmul(np.linalg.inv(np.matmul(X.T,X)),X.T),y)
yh1= np.matmul(X,w1)

plt.plot(x,y,'o')
plt.plot(t,y0,'k-')
plt.plot(x, yh1, 'g')
plt.plot(x, yh, 'g-.')
plt.show()

MSE_train = (1/float(n))*np.sum((y - yh1)**2)

# MSE on fresh data not used in training
X_ = np.vstack([np.ones(len(x_)),x_]).T
yh_= np.matmul(X_,w1)
MSE_test  = (1/float(n_))*np.sum((y_ - yh_)**2) 
MSE = np.array([[1,MSE_train,MSE_test]])

Polynomial features¶

polynomial functions are useful features, as any smooth function can be represented by polynomials of large enough degree using Taylor expansion.

for degere 2 polynomial, we use $(1,x,x^2)$ as feature vector

# create a data matrix X
p=2
X = np.vstack([np.ones(len(x)),x,x**2]).T
display(X)
# degree-2 polynomial linear regression
w2 = np.array(3)
w2 = np.matmul(np.matmul(np.linalg.inv(np.matmul(X.T,X)),X.T),y)
yh2= np.matmul(X,w2)

plt.plot(x,y,'o')
plt.plot(t,y0,'k-')
plt.plot(x, yh2, 'g')
plt.plot(x, yh1, 'g-.')
plt.plot(x, yh, 'g-.')
plt.show()

MSE_train = (1/float(n))*np.sum((y - yh2)**2)

# MSE on fresh data not used in training
X_ = np.vstack([np.ones(len(x_)),x_,x_**2]).T
yh_= np.matmul(X_,w2)
MSE_test  = (1/float(n_))*np.sum((y_ - yh_)**2) 
MSE = np.append(MSE,[[p,MSE_train,MSE_test]],axis=0)

array([[ 1.00000000e+00, -1.00000000e+00,  1.00000000e+00],
       [ 1.00000000e+00, -9.93286629e-01,  9.86618327e-01],
       [ 1.00000000e+00, -9.37192944e-01,  8.78330614e-01],
       [ 1.00000000e+00, -8.93602456e-01,  7.98525349e-01],
       [ 1.00000000e+00, -8.66580613e-01,  7.50961959e-01],
       [ 1.00000000e+00, -8.58403401e-01,  7.36856399e-01],
       [ 1.00000000e+00, -8.40892425e-01,  7.07100071e-01],
       [ 1.00000000e+00, -8.19080208e-01,  6.70892387e-01],
       [ 1.00000000e+00, -8.09438213e-01,  6.55190221e-01],
       [ 1.00000000e+00, -7.30178268e-01,  5.33160304e-01],
       [ 1.00000000e+00, -7.14901736e-01,  5.11084491e-01],
       [ 1.00000000e+00, -6.64602058e-01,  4.41695896e-01],
       [ 1.00000000e+00, -5.96651653e-01,  3.55993195e-01],
       [ 1.00000000e+00, -5.94288369e-01,  3.53178665e-01],
       [ 1.00000000e+00, -5.31337232e-01,  2.82319254e-01],
       [ 1.00000000e+00, -5.27930559e-01,  2.78710675e-01],
       [ 1.00000000e+00, -4.74268168e-01,  2.24930295e-01],
       [ 1.00000000e+00, -4.55869606e-01,  2.07817098e-01],
       [ 1.00000000e+00, -4.24352911e-01,  1.80075393e-01],
       [ 1.00000000e+00, -4.12151250e-01,  1.69868653e-01],
       [ 1.00000000e+00, -4.10289710e-01,  1.68337646e-01],
       [ 1.00000000e+00, -3.95628735e-01,  1.56522096e-01],
       [ 1.00000000e+00, -3.88111451e-01,  1.50630499e-01],
       [ 1.00000000e+00, -3.65134612e-01,  1.33323285e-01],
       [ 1.00000000e+00, -3.26604093e-01,  1.06670233e-01],
       [ 1.00000000e+00, -3.24323342e-01,  1.05185630e-01],
       [ 1.00000000e+00, -3.20368931e-01,  1.02636252e-01],
       [ 1.00000000e+00, -3.18466323e-01,  1.01420799e-01],
       [ 1.00000000e+00, -2.91065531e-01,  8.47191432e-02],
       [ 1.00000000e+00, -2.90804465e-01,  8.45672368e-02],
       [ 1.00000000e+00, -2.90623740e-01,  8.44621580e-02],
       [ 1.00000000e+00, -2.78513532e-01,  7.75697874e-02],
       [ 1.00000000e+00, -2.69638349e-01,  7.27048390e-02],
       [ 1.00000000e+00, -2.66625219e-01,  7.10890072e-02],
       [ 1.00000000e+00, -2.55775651e-01,  6.54211836e-02],
       [ 1.00000000e+00, -2.40187100e-01,  5.76898432e-02],
       [ 1.00000000e+00, -2.39880072e-01,  5.75424490e-02],
       [ 1.00000000e+00, -2.33487147e-01,  5.45162478e-02],
       [ 1.00000000e+00, -2.23221559e-01,  4.98278643e-02],
       [ 1.00000000e+00, -2.06913178e-01,  4.28130634e-02],
       [ 1.00000000e+00, -1.64055047e-01,  2.69140584e-02],
       [ 1.00000000e+00, -1.54644078e-01,  2.39147908e-02],
       [ 1.00000000e+00, -9.09440013e-02,  8.27081137e-03],
       [ 1.00000000e+00, -8.55119256e-02,  7.31228943e-03],
       [ 1.00000000e+00, -7.72228276e-02,  5.96336511e-03],
       [ 1.00000000e+00, -7.47073866e-02,  5.58119361e-03],
       [ 1.00000000e+00, -3.89627136e-02,  1.51809305e-03],
       [ 1.00000000e+00, -3.81164951e-02,  1.45286720e-03],
       [ 1.00000000e+00, -3.68427693e-02,  1.35738965e-03],
       [ 1.00000000e+00, -3.54946916e-02,  1.25987313e-03],
       [ 1.00000000e+00, -2.68164689e-02,  7.19123002e-04],
       [ 1.00000000e+00,  1.75614192e-02,  3.08403445e-04],
       [ 1.00000000e+00,  4.97003390e-02,  2.47012369e-03],
       [ 1.00000000e+00,  6.16623567e-02,  3.80224623e-03],
       [ 1.00000000e+00,  7.11817091e-02,  5.06683571e-03],
       [ 1.00000000e+00,  8.80155099e-02,  7.74672998e-03],
       [ 1.00000000e+00,  1.51934292e-01,  2.30840291e-02],
       [ 1.00000000e+00,  1.71052509e-01,  2.92589609e-02],
       [ 1.00000000e+00,  1.84087844e-01,  3.38883344e-02],
       [ 1.00000000e+00,  1.84988201e-01,  3.42206347e-02],
       [ 1.00000000e+00,  2.07488631e-01,  4.30515322e-02],
       [ 1.00000000e+00,  2.20100391e-01,  4.84441823e-02],
       [ 1.00000000e+00,  2.50967965e-01,  6.29849195e-02],
       [ 1.00000000e+00,  2.52375517e-01,  6.36934017e-02],
       [ 1.00000000e+00,  2.66651648e-01,  7.11031016e-02],
       [ 1.00000000e+00,  3.10251987e-01,  9.62562956e-02],
       [ 1.00000000e+00,  3.22296737e-01,  1.03875187e-01],
       [ 1.00000000e+00,  3.52965927e-01,  1.24584946e-01],
       [ 1.00000000e+00,  3.76584905e-01,  1.41816191e-01],
       [ 1.00000000e+00,  3.99760592e-01,  1.59808531e-01],
       [ 1.00000000e+00,  4.07516607e-01,  1.66069785e-01],
       [ 1.00000000e+00,  4.43527500e-01,  1.96716643e-01],
       [ 1.00000000e+00,  4.43831957e-01,  1.96986806e-01],
       [ 1.00000000e+00,  4.48637637e-01,  2.01275730e-01],
       [ 1.00000000e+00,  4.72682760e-01,  2.23428992e-01],
       [ 1.00000000e+00,  5.11652676e-01,  2.61788461e-01],
       [ 1.00000000e+00,  5.17568371e-01,  2.67877018e-01],
       [ 1.00000000e+00,  5.24703686e-01,  2.75313959e-01],
       [ 1.00000000e+00,  5.42347386e-01,  2.94140687e-01],
       [ 1.00000000e+00,  5.49047137e-01,  3.01452758e-01],
       [ 1.00000000e+00,  5.70393300e-01,  3.25348516e-01],
       [ 1.00000000e+00,  6.10150978e-01,  3.72284216e-01],
       [ 1.00000000e+00,  6.13021791e-01,  3.75795717e-01],
       [ 1.00000000e+00,  6.16403937e-01,  3.79953814e-01],
       [ 1.00000000e+00,  6.55575672e-01,  4.29779462e-01],
       [ 1.00000000e+00,  6.72404942e-01,  4.52128407e-01],
       [ 1.00000000e+00,  7.12739980e-01,  5.07998278e-01],
       [ 1.00000000e+00,  7.12897231e-01,  5.08222462e-01],
       [ 1.00000000e+00,  7.38142584e-01,  5.44854475e-01],
       [ 1.00000000e+00,  7.44107813e-01,  5.53696438e-01],
       [ 1.00000000e+00,  7.54938673e-01,  5.69932400e-01],
       [ 1.00000000e+00,  8.50957780e-01,  7.24129144e-01],
       [ 1.00000000e+00,  8.95169604e-01,  8.01328619e-01],
       [ 1.00000000e+00,  9.05585625e-01,  8.20085325e-01],
       [ 1.00000000e+00,  9.09326984e-01,  8.26875564e-01],
       [ 1.00000000e+00,  9.21828654e-01,  8.49768068e-01],
       [ 1.00000000e+00,  9.46241656e-01,  8.95373271e-01],
       [ 1.00000000e+00,  9.52961257e-01,  9.08135156e-01],
       [ 1.00000000e+00,  9.88437555e-01,  9.77008800e-01],
       [ 1.00000000e+00,  1.00000000e+00,  1.00000000e+00]])

for p=3 degree polynomial, we use feature vector $(1,x,x^2,x^3)$

# create a data matrix X
p=3
X = np.vstack([np.ones(len(x)),x,x**2,x**3]).T

# degree-p polynomial linear regression
w3 = np.array(p+1)
w3 = np.matmul(np.matmul(np.linalg.inv(np.matmul(X.T,X)),X.T),y)
yh3= np.matmul(X,w3)

plt.plot(x,y,'o')
plt.plot(t,y0,'k-')
plt.plot(x, yh3, 'g')
plt.show()

MSE_train = (1/float(n))*np.sum((y - yh3)**2)

# MSE on fresh data not used in training
X_ = np.vstack([np.ones(len(x_)),x_,x_**2,x_**3]).T
yh_= np.matmul(X_,w3)
MSE_test  = (1/float(n_))*np.sum((y_ - yh_)**2) 
MSE = np.append(MSE,[[p,MSE_train,MSE_test]],axis=0)

# create a data matrix X
p=4
X = np.vstack([np.ones(len(x)),x,x**2,x**3,x**4]).T

# degree-p polynomial linear regression
w4 = np.array(p+1)
w4 = np.matmul(np.matmul(np.linalg.inv(np.matmul(X.T,X)),X.T),y)
yh4= np.matmul(X,w4)

plt.plot(x,y,'o')
plt.plot(t,y0,'k-')
plt.plot(t, np.matmul(np.vstack([np.ones(len(t)),t,t**2,t**3,t**4]).T,w4), 'g')
plt.show()

MSE_train = (1/float(n))*np.sum((y - yh4)**2)

# MSE on fresh data not used in training
X_ = np.vstack([np.ones(len(x_)),x_,x_**2,x_**3,x_**4]).T
yh_= np.matmul(X_,w4)
MSE_test  = (1/float(n_))*np.sum((y_ - yh_)**2) 
MSE = np.append(MSE,[[p,MSE_train,MSE_test]],axis=0)

# create a data matrix X
p=5
X = np.vstack([np.ones(len(x)),x,x**2,x**3,x**4,x**5]).T

# degree-p polynomial linear regression
w5 = np.array(p+1)
w5 = np.matmul(np.matmul(np.linalg.inv(np.matmul(X.T,X)),X.T),y)
yh5= np.matmul(X,w5)

plt.plot(x,y,'o')
plt.plot(t,y0,'k-')
plt.plot(t, np.matmul(np.vstack([np.ones(len(t)),t,t**2,t**3,t**4,t**5]).T,w5), 'g')
plt.show()

MSE_train = (1/float(n))*np.sum((y - yh5)**2)

# MSE on fresh data not used in training
X_ = np.vstack([np.ones(len(x_)),x_,x_**2,x_**3,x_**4,x_**5]).T
yh_= np.matmul(X_,w5)
MSE_test  = (1/float(n_))*np.sum((y_ - yh_)**2) 
MSE = np.append(MSE,[[p,MSE_train,MSE_test]],axis=0)

# create a data matrix X
p=10
X = np.vstack([np.ones(len(x)),x,x**2,x**3,x**4,x**5,x**6,x**7,x**8,x**9,x**10]).T

# degree-p polynomial linear regression
w10 = np.array(p+1)
w10 = np.matmul(np.matmul(np.linalg.inv(np.matmul(X.T,X)),X.T),y)
yh10= np.matmul(X,w10)

plt.plot(x,y,'o')
plt.plot(t,y0,'k-')
plt.plot(t, np.matmul(np.vstack([np.ones(len(t)),t,t**2,t**3,t**4,t**5,t**6,t**7,t**8,t**9,t**10]).T,w10), 'g')
plt.show()

MSE_train = (1/float(n))*np.sum((y - yh10)**2)

# MSE on fresh data not used in training
X_ = np.vstack([np.ones(len(x_)),x_,x_**2,x_**3,x_**4,x_**5,x_**6,x_**7,x_**8,x_**9,x_**10]).T
yh_= np.matmul(X_,w10)
MSE_test  = (1/float(n_))*np.sum((y_ - yh_)**2) 
MSE = np.append(MSE,[[p,MSE_train,MSE_test]],axis=0)

# create a data matrix X
p=15
X = np.vstack([np.ones(len(x)),x,x**2,x**3,x**4,x**5,x**6,x**7,x**8,x**9,x**10,x**11,x**12,x**13,x**14,x**15]).T

# degree-p polynomial linear regression
w15 = np.array(p+1)
w15 = np.matmul(np.matmul(np.linalg.inv(np.matmul(X.T,X)),X.T),y)
yh15= np.matmul(X,w15)

plt.plot(x,y,'o')
plt.plot(t,y0,'k-')
plt.plot(t, np.matmul(np.vstack([np.ones(len(t)),t,t**2,t**3,t**4,t**5,t**6,t**7,t**8,t**9,t**10,t**11,t**12,t**13,t**14,t**15]).T,w15), 'g')
plt.show()

MSE_train = (1/float(n))*np.sum((y - yh15)**2)

# MSE on fresh data not used in training
X_ = np.vstack([np.ones(len(x_)),x_,x_**2,x_**3,x_**4,x_**5,x_**6,x_**7,x_**8,x_**9,x_**10,x_**11,x_**12,x_**13,x_**14,x_**15]).T
yh_= np.matmul(X_,w15)
MSE_test  = (1/float(n_))*np.sum((y_ - yh_)**2) 
MSE = np.append(MSE,[[p,MSE_train,MSE_test]],axis=0)

# create a data matrix X
p=20
X = np.vstack([np.ones(len(x)),x,x**2,x**3,x**4,x**5,x**6,x**7,x**8,x**9,x**10,x**11,x**12,x**13,x**14,x**15,x**16,x**17,x**18,x**19,x**20]).T

# degree-p polynomial linear regression
w20 = np.array(p+1)
w20 = np.matmul(np.matmul(np.linalg.inv(np.matmul(X.T,X)),X.T),y)
yh20= np.matmul(X,w20)

plt.plot(x,y,'o')
plt.plot(t,y0,'k-')
plt.plot(t, np.matmul(np.vstack([np.ones(len(t)),t,t**2,t**3,t**4,t**5,t**6,t**7,t**8,t**9,t**10,t**11,t**12,t**13,t**14,t**15,t**16,t**17,t**18,t**19,t**20]).T,w20), 'g')
plt.show()

MSE_train = (1/float(n))*np.sum((y - yh20)**2)

# MSE on fresh data not used in training
X_ = np.vstack([np.ones(len(x_)),x_,x_**2,x_**3,x_**4,x_**5,x_**6,x_**7,x_**8,x_**9,x_**10,x_**11,x_**12,x_**13,x_**14,x_**15,x_**16,x_**17,x_**18,x_**19,x_**20]).T
yh_= np.matmul(X_,w20)
MSE_test  = (1/float(n_))*np.sum((y_ - yh_)**2) 
MSE = np.append(MSE,[[p,MSE_train,MSE_test]],axis=0)

feature engineering requires domain knowledge about what are good features for the application

import matplotlib.pyplot as plt
plt.plot(MSE.T[0],MSE.T[1],'o-',label='train error')
plt.plot(MSE.T[0],MSE.T[2],'o-',label='test error')
plt.legend()
plt.show()

the training error always decrease as we increase the complexity of the model (which in this example is increasing the degree of the polynomial)
on the other hand, the test error that we compute, which is the average error over fresh drawn samples, has a sweet spot where the test error is minimum
this corresponds usually to the true complexity of the model, which in this case is degree-5 polynomial -although we did not know (just frmo the samples) that the ground truth is a degree-5 polynomial, we could use the test error to guess the degree of the model we need to use to fit this data (and this process is known as hyper parameter tuning using test error)
we will learn much more about it in next couple of weeks