Your goal for this homework is to implement a decision stump learner, which will be later used as a weak learner for ensembles. Our data was provided by UCI Machine Learning Repository and can be found in Molecular Biology (Promoter Gene Sequences) Data Set. It has 106 instances and 57 features. We randomly split the data set into the training (71 instances) and validation (35 instances) sets, which are both well balanced.
You can download the data set which contains the training data and validation sets. Each DNA sequence is represented by one line, with the first 57 characters (one of 'a', 'g', 'c' and 't') representing the sequence, and the last character (separated by a space from the sequence) indicating the class ('+' for promoter, '-' for non-promoter).
We provide support code ensemble.java ensemble.py for this problem set. We have implemented those none-interesting functions (i.e. I/O, computing accuracy) for you. Your job is to follow the instructions below and finish your ensemble algorithms.
(1 point) Question: draw the stump you get when you use the training set (71 instances) to train the stump, and tell us the accuarcy if you apply this stump on the validation set (35 instances).
Bagging: Given a standard training set D of size N, bagging generates M new training sets Di, each of size L > N , by sampling examples from D uniformly and with replacement. By sampling with replacement, it is likely that some examples will be repeated in each Di. This kind of sample is known as a bootstrap sample. The M models are trained using the above M bootstrap samples and combined by voting. You can find more details in Wikipedia page.
Adaboosting: You have learned boosting in class. You should follow the Adaboosting pseudo code:
Now, "Learner" above is Stump.
To grade your code, we will replace your Ensemble.main with our main (which may train/test over a new dataset, print accuracy, check variable etc. ). Our main function is similar to the main we gave you in support code. So make sure your submitted code can work properly without changing *anything* in given Ensemble.main (If your code can work with the support code main but fail in the new main, we will manually check the reason. Don't worry). Of course you are free to write anything when you are debugging your code or answering our questions.Try to avoid defining new classes or global variables unless you are very sure they won't mess up your code. Note: Do *NOT* change given name/type/return/arguments of variable/function/class in support code, you may lose all your points. Since python do not have object types explicitly, it may be useful for python writers to check our java support code for object type information.
Extra CreditIn ps 1-3, you have implemented three learners: decision tree, naive bayes, logistic regression. Let us call them "strong learner". Pick one of them, replace Stump with this strong learner (you might need to change your previous implementations), try bagging/adaboosting.
You don't have support code for extra credit, so you are free to do anything. You can copy/paste the code in ensemble.java/ensemble.py, but conversely, don't put your extra credit implementations into ensemble.java/ensemble.py. You should create a directory "extra" put everything you want us to see in "extra", zip it.
Submission