data from https://web.stanford.edu/~hastie/Papers/LARS/diabetes.sdata.txt

  • 10 explonatory variables of each patient (age, sex, bmi, bp, s1,s2,s3,s4,s5,s6)
  • outcome $y$ is mesure of diabetes progression over 1 year
  • we want to predict $y$ given those 10 features
  • we add a constant feature $x_1=1$ and make each input data 11 dimensional
In [8]:
import numpy as np
filename = "diabetes.txt"
data = np.loadtxt(filename)
X = data.T[0:10]
Y = data.T[10]
X.shape
Out[8]:
(10, 442)

we first train 10 linear predictor, each one using only one feature

In [11]:
#import mpld3
#mpld3.enable_notebook()
#plt.rcParams['figure.figsize'] = [9.5,6]
import matplotlib.pyplot as plt
for i in range(10):
    plt.subplot(1,2,2-(i+1)%2)
    X_ = np.vstack([X[i], np.ones(len(X[i]))]).T
    w1, w0 = np.linalg.lstsq(X_, Y, rcond=None)[0]
    t = np.linspace(np.min(X[i]),np.max(X[i]),20)
    plt.plot(X[i],Y,'.')
    plt.plot(t, w1*t + w0, 'r')
    if ((i+1)%2==0):
        plt.show()

BMI is the best (single) predictor

  • the best single feature prediction achieves MSE=3890
In [13]:
plt.plot(X[2],Y,'.')
plt.plot(t, w1*t + w0, 'r')
plt.show()

the best linear predictor using all 10 features

  • we are plotting $y$ on the horizontal axis and $\hat{y}$ on the vertical
  • close the points are to a straight line, the better predictor we have
  • best (all features) linear predictor is better (scatter plot of the points closer to a line) than best single feature linear predictor
In [14]:
A = np.vstack([X, np.ones(len(X[i]))]).T
w=np.zeros(11)
w = np.linalg.lstsq(A, Y, rcond=None)[0]
Y_=np.matmul(A,w)
#    t = np.linspace(np.min(X[i]),np.max(X[i]),20)
plt.plot(Y,Y_,'.')
plt.plot([-130, 200],[-130, 200],'r-')
plt.show()
In [ ]: