Due Date: Monday, April 21, 2008 in class and submit code online (the fourth week) |
In this mini-project, you will implement a decision-tree algorithm and apply it to drug design. Thrombin is an enzyme that plays an important rule in coagulation (i.e., blood clotting). Inappropriate coagulation in blood vessels can cause deep vein thrombosis, pulmonary embolism, myocardial infarctions (a.k.a. heart attacks) and strokes. Recently, drug companies start developing medications that deactivate thrombin by binding small molecules to it. The question is what molecules bind well to thrombin. Drug companies have a long list of candidate molecules but it is expensive and time-consuming to test them all in turn. Now, machine learning can help. Chemists and pharmacologists have identified a number of attributes potentially relevant to binding, and their values are known for the candidate molecules. Also, drug companies have tested a few candidates and know whether they bind well, which can serve as the labeled data. The task for you is to develop a decision tree algorithm, learn from data, and predict for unseen molecules whether they could bind well to thrombin.
Our data was provided by DuPont and can be found in KDD Cup 2001 (thrombin task). The original dataset is extremely challenging. It has 139,351 features, which can be difficult to handle. Moreover, the training set contains too few positive examples and is highly imbalanced. So we are providing you with a simplified version: it has only 635 features, and the training and validation sets are better balanced. (We mixed the original training and test sets and randomly split into the new training and validation sets; we then conduct feature selection by filtering out the ones with low mutual information with the class in the training set.)
|
Good luck, and have fun! |