Project 3: Boosting, Clustering, EM

What to turn in: your code and a write up of the questions in this document.
Due Fri Feb 17 at 5pm (files listed below in DropBox; write up hardcopy in Lydia's mailbox)

Boosting

For this problem you will first implement a decision stump learner, which will later be used as a weak learner for boosting.

Decision Stump Learner

Implement a decision stump learner, to learn the best decision tree with a single split. Make sure your classifier properly handles weighted data, so that it can be used as a base learner for boosting. Unlike Assignment 1, the attributes are not binary. They are now continuous variables on the interval [0, 1]. The target attribute ("republicans") is still binary, but now it takes on values {-1, 1}, not {0, 1}.

Question 1 Briefly describe your strategy for learning on weighted data.
Question 2 What metric did you use to identify the best possible attribute?

Boosting

Implement boosting over the decision stump learner. Evaluate your implementation on three datasets: (download here)

republicansTrainA.py
republicansTestA.py

republicansTrainB.py
republicansTestB.py

republicansTrainC.py
republicansTestC.py

These datasets are made from the following distributions:

def generateData(n, AorBorC):
    data = []
    for i in range(n):
        datapoint = {}
        
        datapoint["salary"] = random.random()
        datapoint["age"] = random.random()
                      
        if(AorBorC == "A"):
            datapoint["republican"] = -1
            if datapoint["salary"] > 0.75 or datapoint["salary"] < 0.25:            
                datapoint["republican"] = 1  
        
        if(AorBorC == "B"):
            datapoint["republican"] = random.random() 
            if (datapoint["salary"] > 0.75 or datapoint["salary"] < 0.25) and (datapoint["age"] > 0.75 or datapoint["age"] < 0.25):            
                datapoint["republican"] = 1 if random.random() > .05 else -1 
            else:
                datapoint["republican"] = -1 if random.random() > .05 else 1
        
        if(AorBorC == "C"):
            datapoint["republican"] = 1 
            if (datapoint["salary"] - 0.5)**2 + (datapoint["age"] - 0.5)**2 < (.25)**2:            
                datapoint["republican"] = -1
        
        data.append(datapoint)

    return data

Question 3 For each dataset, report the training and testing error for T = 5, 10, and 30 rounds of boosting.

Question 4 Can you get your boosted decision stump learner to overfit on any of the datasets? Briefly justify why, or why not.

Question 5 For each dataset, how many iterations of boosting does it take until training error is 0? (If you think the number is very, very high, just say so and briefly justify your answer. You don't have to actually try millions of iterations!) What is the test error for this case (when possible)? Will test error decrease further if you keep training?

Clustering and EM

Implement EM using a Gaussian Mixture Model to estimate the mean, variance, and probability for each cluster in this dataset. You should turn in your code and answer the questions below in your write up.

The data was generated with the following code:

def generateData(n):
    mean1 = 2.5
    sigma1 = 1.0
    mean2 = 7.5
    sigma2 = 1.0
    data = []
    for i in range(n):
        data.append(random.gauss(mean1,sigma1))
        data.append(random.gauss(mean2,sigma2))
    return data

What to run:
For all experiments, you can fix the number of clusters to 2. You should experiment with 2 different versions of EM. First, set the variance and cluster probabilities by hand, so that you only have to learn the means. Next, learn all of the parameters. For each version, answer the questions below.

For your write up:

Question 6 Describe your initialization strategy. What variables and values did you set (and to what values) to make EM work on this dataset? Did you need to do random restarts?

Question 7 What were your resulting parameter estimates for the clusters?

Question 8 If possible, describe two settings that did not perform well. What were the estimated parameters? Briefly speculate about what may have gone wrong. (We are looking for two different reasons.)

Hints:

In general, GMM EM can suffer from overflow and underflow problems. A careful implementation would do the math in log space, but we recommend avoiding this complication for the homework. In the dataset we gave you, these problems should be limited, but it is very possible you will encounter them. You are allowed to hack around the underflow problem (One method is to check if certain important values are 0, and set them to something non-zero, but small).
You don't necessarily have to use all the data. To take a representative subset of the data, you can use a command such as data = data[:100]
To test the EM portion of your algorithm, seed the parameters you are estimating to "very reasonable" values. Remember, EM can be very sensitive to initial conditions.
Only after you get EM working on hand-choosen seeds should you consider using randomized restarts.
When picking a window to select random seeds from, don't be too generous. EM can sensitive to initial conditions!
Don't forget to normalize the estimated probabilities in the E step.

Good luck! Have fun!