For this problem you will first implement a decision stump learner, which will later be used as a weak learner for boosting.
Implement a decision stump learner, to learn the best decision tree with a single split. Make sure your classifier properly handles weighted data, so that it can be used as a base learner for boosting. Unlike Assignment 1, the attributes are not binary. They are now continuous variables on the interval [0, 1].
The target attribute ("republicans") is still binary, but now it takes on values {-1, 1}, not {0, 1}.
Question 1 Briefly describe your strategy for learning on weighted data.
Question 2 What metric did you use to identify the best possible attribute?
def generateData(n, AorBorC):
data = []
for i in range(n):
datapoint = {}
datapoint["salary"] = random.random()
datapoint["age"] = random.random()
if(AorBorC == "A"):
datapoint["republican"] = -1
if datapoint["salary"] > 0.75 or datapoint["salary"] < 0.25:
datapoint["republican"] = 1
if(AorBorC == "B"):
datapoint["republican"] = random.random()
if (datapoint["salary"] > 0.75 or datapoint["salary"] < 0.25) and (datapoint["age"] > 0.75 or datapoint["age"] < 0.25):
datapoint["republican"] = 1 if random.random() > .05 else -1
else:
datapoint["republican"] = -1 if random.random() > .05 else 1
if(AorBorC == "C"):
datapoint["republican"] = 1
if (datapoint["salary"] - 0.5)**2 + (datapoint["age"] - 0.5)**2 < (.25)**2:
datapoint["republican"] = -1
data.append(datapoint)
return data
Question 3 For each dataset, report the training and testing error for T = 5, 10, and 30 rounds of boosting.
def generateData(n):
mean1 = 2.5
sigma1 = 1.0
mean2 = 7.5
sigma2 = 1.0
data = []
for i in range(n):
data.append(random.gauss(mean1,sigma1))
data.append(random.gauss(mean2,sigma2))
return data
What to run:For your write up:
Question 6 Describe your initialization strategy. What variables and values did you set (and to what values) to make EM work on this dataset? Did you need to do random restarts?
Question 7 What were your resulting parameter estimates for the clusters?
Question 8 If possible, describe two settings that did not perform well. What were the estimated parameters? Briefly speculate about what may have gone wrong. (We are looking for two different reasons.)
Hints:
Good luck! Have fun!