Linear regression

we want to solve linear regression $${\rm minimize}_w \|{\bf X}w-{\bf y}\|_2^2$$ first generate input, output data examples

there is a subtlety (in notations mainly) in how we handle constant

given data $x=(x[1],x[2],x[3],\cdots,x[d])$ in $d$-dimensions, one option is to use linear predictor of the form $$f(x) = w^Tx = w_1x[1] + w_2x[2] + \cdots, w_dx[d]$$

anogher option is to use affine predictor, by appending a one to the data, i.e. $x=(1,x[1],x[2],x[3],\cdots,x[d])$ $$f(x) = w^Tx = w_0 + w_1x[1] + \cdots + w_d x[d]$$

We will use these notations interchangeably, as it should be clear from the context which one we are using

next, create data matrix ${\bf X}$ $${\bf X} = \begin{bmatrix} (x_1)^T\\ (x_2)^T\\ \vdots \\ (x_n)^T \end{bmatrix}$$

Constant fit example

we are going to ignore any input data and try to fit a constanf function $f(x)=w_1$

we create a data matrix with $x_i=1$

data matrix is

Recall that the least squares solution is \begin{eqnarray} \hat{w}_{\rm LS} &=& (X^TX)^{-1}X^T y\\ &=& n^{-1} 1^T y \\ &=& \frac{1}{n}\sum_{i=1}^n y_i \end{eqnarray}

which is just the average of the sample outcome data, and the prediction is $\hat{w}_{\rm LS} = \frac{1}{n}\sum_{i=1}^n y_i$ for all data point $x$,

$$ \hat{y} = \begin{bmatrix} \hat{y}_1\\ \vdots\\ \hat{y}_n \end{bmatrix} = \begin{bmatrix} \hat{w}_{\rm LS}^Tx_1 \\ \vdots \\ \hat{w}_{\rm LS}^T x_n\end{bmatrix} = \hat{w}_{\rm LS} \begin{bmatrix}1\\ \vdots\\ 1\end{bmatrix}$$

hence, the average is the best constant fit for suqared loss

and the MSE is the variance of the outcome:

$$ \frac1n \sum_{i=1}^n ({\rm ave}(y)-y_i)^2$$

for affine fit, we use $(1,x_i)$ as the input data, and the data matrix ${\bf X}$ is

the prediction has the form $$\hat{y} = w_1+w_2x$$ this is called straight-line fit and if $x$ is time, it is called trend line

Polynomial features

polynomial functions are useful features, as any smooth function can be represented by polynomials of large enough degree using Taylor expansion.

for degere 2 polynomial, we use $(1,x,x^2)$ as feature vector

for p=3 degree polynomial, we use feature vector $(1,x,x^2,x^3)$

feature engineering requires domain knowledge about what are good features for the application