CSEP 546: Data Mining (Spring 2017)
Assignment 4: Learning Theory and SVMs

Please submit both code and writeup online by 12:30am PST on Monday, June 5, 2017. Please provide all code (sufficiently commented) so that we can run it ourselves. Submit your writeup as a PDF and limit to 4 pages with reasonable fonts and margins. Please do not forget to write your name on the first page of your writeup.

Note about the dataset:

In this homework you will use the Bank Default dataset for running your experiments in problem 1 and 4. You can find the dataset in XLS format here.. Also exported in CSV format here. (training, testing). Also the dataset explanation is available at UCI Machine Learning Repository.

The dataset was split from originally 30000 entries to 25500 (85%) training and 4500 (15%) testing data.

Problem 1: Bias and Variance

We will be using your ID3 learner (normal and bagged) on the bank defaulting dataset (training, testing). If you suspect your implementation was faulty, you are free to use an existing implementation of these algorithms (e.g. the J48 learner in the Weka Java libraries). Do not use pruning with this problem.

1.1 Modify your ID3 to accept a parameter of maximum depth, beyond which no further splitting is allowed. Consider the ID3 learner alone, speculate whether its bias increase or decrease with this parameter? What about its variance? Why? Verify your conclusion by actually measuring bias and variance for different numbers of maximum depth. At the minimum, try maximum depths of 1, 2, and 3.

1.2 How does the accuracy obtained by ID3 with bagging change as the maximum depth varies? Why? Measure bias and variance in each case. What does this say about your hypothesis? In the previous question you measure the Bias and variance of the bare ID3 Learner. In this case you are measuring the BV of the bagged learner. So you have to do two layered sampling from the datasets, because as you know estimating the BV requires you to do sampling in a manner similar to bagging.

To learn more about bias-variance decomposition, see Pedro's ICML paper. Pedro also has a reference implementation. In this homework, generate at least 10 bootstrap samples (30 is recommended if you could afford it) from the original training set and use them when you compute the bias and variance.

Problem 2: PAC Learning

2.0 Mitchell: 7.2.

Problem 3: VC Dimension

3.0 Mitchell: 7.5.

Problem 4: SVMs

4.0 Download LIBSVM, currently the most widely used SVM implementation. Peruse the documentations to understand how to use it.

4.1 Download the bank defaulting dataset (training, testing).

4.2 Run LIBSVM to detect credit defaults. Try different kernels (0-3) and use default parameters for everything else. How does it vary with different choice of kernel?

4.3 Modify your naive bayes classifier from homework 3 to apply to this dataset. How does the accuracy compare to what can be obtained using LIBSVM? Why?

4.4 How do your accuracies compare with that of the bagged decision tree from Problem 1? Also, measure the bias and variance of SVMs on the Bank default dataset and compare with bagging's. Explain the difference (or lack of difference).