CSE 546 - Data Mining - Spring 2008

CSE 546: Data Mining (Spring 2008)

Assignment 1: Decision tree learning for drug design

Due Date: Monday, April 21, 2008 in class and submit code online (the fourth week)

In this mini-project, you will implement a decision-tree algorithm and apply it to drug design. Thrombin is an enzyme that plays an important rule in coagulation (i.e., blood clotting). Inappropriate coagulation in blood vessels can cause deep vein thrombosis, pulmonary embolism, myocardial infarctions (a.k.a. heart attacks) and strokes. Recently, drug companies start developing medications that deactivate thrombin by binding small molecules to it. The question is what molecules bind well to thrombin. Drug companies have a long list of candidate molecules but it is expensive and time-consuming to test them all in turn. Now, machine learning can help. Chemists and pharmacologists have identified a number of attributes potentially relevant to binding, and their values are known for the candidate molecules. Also, drug companies have tested a few candidates and know whether they bind well, which can serve as the labeled data. The task for you is to develop a decision tree algorithm, learn from data, and predict for unseen molecules whether they could bind well to thrombin.

Our data was provided by DuPont and can be found in KDD Cup 2001 (thrombin task). The original dataset is extremely challenging. It has 139,351 features, which can be difficult to handle. Moreover, the training set contains too few positive examples and is highly imbalanced. So we are providing you with a simplified version: it has only 635 features, and the training and validation sets are better balanced. (We mixed the original training and test sets and randomly split into the new training and validation sets; we then conduct feature selection by filtering out the ones with low mutual information with the class in the training set.)

Download the dataset which contains the training data and validation set. Each molecule is represented by one line, with columns separated by commas: the last one is the class (A for active binding, I for inactive), the rest are the attribute values (0 or 1).
Implement the ID3 decision tree learner, as described in Chapter 8 of Duda et al. or Chapter 3 of Mitchell. You can assume boolean classes and attributes. You may program in C/C++ or Java. Your program should assume input in the above format. For initial debugging, it is recommended that you construct a very simple data set (e.g., based on a boolean formula) and test your program on it.
Implement both accuracy (misclassification impurity) and information gain (entropy impurity) for evaluation criterion. Also, implement split stopping using chi-square test (see Duda et al. 8.3.3 for details). (If you forgot all you learned in Stat 101, check out Appendix A.6 in Duda et al.)
Use your algorithm to train a decision tree classifier and report accuracy on validation. Compare accuracies by varying the evaluation criteria and confidence level in determining split stopping. For the latter, use 1%, 5% and 1 (i.e., you always grow the full tree).
Turn in the following:
- Your code. Submit your files here. The due date is 12PM, April 21, 2008. Your code should contain appropriate comments to facilitate understanding.
- A report of at most 5 pages (letter size, 1 inch margins, 12pt font) that describes:
  - A high-level description on how your code works.
  - The accuracies you obtain under various settings.
  - Explain which options work well and why.
  - If all your accuracies are low, tell us what you have tried to improve the accuracies and what you suspect is failing.