Problem Set 4 - Ensembles

Due Mar 9 11:59pm

Your goal for this homework is to implement a decision stump learner, which will be later used as a weak learner for ensembles. Our data was provided by UCI Machine Learning Repository and can be found in Molecular Biology (Promoter Gene Sequences) Data Set. It has 106 instances and 57 features. We randomly split the data set into the training (71 instances) and validation (35 instances) sets, which are both well balanced.

You can download the data set which contains the training data and validation sets. Each DNA sequence is represented by one line, with the first 57 characters (one of 'a', 'g', 'c' and 't') representing the sequence, and the last character (separated by a space from the sequence) indicating the class ('+' for promoter, '-' for non-promoter).

We provide support code ensemble.java ensemble.py for this problem set. We have implemented those none-interesting functions (i.e. I/O, computing accuracy) for you. Your job is to follow the instructions below and finish your ensemble algorithms.

  • Stump: a decision stump is a decision tree of depth one (i.e., it branches on one attribute and makes decisions). You can find more details in Wikipedia page.
    1. (1 point) Implement class Stump by finishing two functions. Learning is done in Stump.learn; classification is done in Stump.classify. Please use misclassification impurity for evaluation criteria.
    2. (1 point) Question: draw the stump you get when you use the training set (71 instances) to train the stump, and tell us the accuarcy if you apply this stump on the validation set (35 instances).

  • Bagging: Given a standard training set D of size N, bagging generates M new training sets Di, each of size L > N , by sampling examples from D uniformly and with replacement. By sampling with replacement, it is likely that some examples will be repeated in each Di. This kind of sample is known as a bootstrap sample. The M models are trained using the above M bootstrap samples and combined by voting. You can find more details in Wikipedia page.

    1. (1 point) Uniform resampling with replacement: implement the funciton Ensemble.sampleUniform. The return value of this funciton is a list of L elements sampled uniformly from parameter training.
    2. (1 point) Question: Wikipedia tells us: if L=N, then for large N the set Di is expected to have 63.2% of the unique examples of D, the rest being duplicates.
    3. You need to empirically validate this conclusion. Please finish Ensemble.uniqueExamples. You need to sample 71 instances from the training set, and count how many unique examples there are. Try this process 10 times and tell us your average.
    4. (2 point) Bagging: implement the funciton Ensemble.baggingClassify, using Ensemble.sampleUniform you have implemented in the last step. The return value of Ensemble.baggingClassify is a list of predictions (boolean values, positive for + and negative for -) for instances in parameter validation.
    5. (0.5 Point) Question: tell us your accuarcy on validation instances by setting L=71, M=20
  • Adaboosting: You have learned boosting in class. You should follow the Adaboosting pseudo code:

    pseudo code

    Now, "Learner" above is Stump.

    1. (1 point) Resampling by weight: Learner(S,weights) above means the training set is sampled from the original set according to weights. Higher weight means an instance has a higher probability to be chosen. Compare to "Uniform sampling with replacement", the only difference is that the distribution is given by weight instead of uniform. Finish Ensemble.sampleWeight. Notice the sum of weight is not necessary to be 1.
    2. (2 point) Adaboosting: finish Ensemble.adaboostClassify by following the pseudo code and our support code.
    3. (0.5 point) Question: the intuition of adaboost is that the weight of misclassified examples will be increased. Do you observe this phenomenon? Tell us some examples. (The support code we give you has already assign each instance a uniqe ID. So you can mention them by that ID, you can say something like the weight of instance XX is increased in round X from XX to XX because XXX)
    4. (1 point) Question: tell us the accuracy by setting round k=20, and always resampling L=71 instances each round. Compare it with the accuarcy of bagging, tell us why.
  • Code Grading

    To grade your code, we will replace your Ensemble.main with our main (which may train/test over a new dataset, print accuracy, check variable etc. ). Our main function is similar to the main we gave you in support code. So make sure your submitted code can work properly without changing *anything* in given Ensemble.main (If your code can work with the support code main but fail in the new main, we will manually check the reason. Don't worry). Of course you are free to write anything when you are debugging your code or answering our questions.Try to avoid defining new classes or global variables unless you are very sure they won't mess up your code. Note: Do *NOT* change given name/type/return/arguments of variable/function/class in support code, you may lose all your points. Since python do not have object types explicitly, it may be useful for python writers to check our java support code for object type information.

    Extra Credit

    In ps 1-3, you have implemented three learners: decision tree, naive bayes, logistic regression. Let us call them "strong learner". Pick one of them, replace Stump with this strong learner (you might need to change your previous implementations), try bagging/adaboosting.

  • (0.5 point) Question: compare the accuracy of strongLearner with stump.
  • (1 point) Question: compare bagging+strongLearner with bagging+stump. Discuss.
  • (1 point) Question: compare adaboosting+strongLearner with adaboosting+stump. Discuss.
  • You don't have support code for extra credit, so you are free to do anything. You can copy/paste the code in ensemble.java/ensemble.py, but conversely, don't put your extra credit implementations into ensemble.java/ensemble.py. You should create a directory "extra" put everything you want us to see in "extra", zip it.

    Submission
  • writeup: You should focus on answering our questions above. Make sure we can quickly locate your answers. We suggest you to follow this template writeup.doc. At most 4 pages.
  • ensemble.java / ensemble.py: finish our support code by follow above instructions. You don't need to create other classes.
  • extra.zip: if you do extra credit
  • Don't submit anything else.