In this mini-project, you will implement a decision-tree algorithm and apply it to molecular biology. Promoter is a region of DNA that facilitates the transcription of a particular gene. Promoter compilations and analyses have led to computer programs which predict the location of promoter sequences on the basis of homology either to the consensus sequence or to a reference list of promoters. Such programs are of practical significance in searching new sequences. The task for you is to develop a decision tree algorithm, learn from data, and predict for unseen DNA sequences whether they are promoters or non-promoters.
Our data was provided by UCI Machine Learning Repository and can be found in Molecular Biology (Promoter Gene Sequences) Data Set. It has 106 instances and 57 features. We randomly split the data set into the training (71 instances) and validation (35 instances) sets, which are both well balanced.
|
- Download the data set which contains the training data and validation sets. Each DNA sequence is represented by one line, with the first 57 characters (one of 'a', 'g', 'c' and 't') representing the sequence, and the last character (separated by a space from the sequence) indicating the class ('+' for promoter, '-' for non-promoter).
- Implement the ID3 decision tree learner, as described in Chapter 8 of Duda et al. or Chapter 3 of Mitchell. Another very good description of the algorithm can be found in the original ID3 paper [1], which is pretty readable. (Download pdf here. UW IP address may be required.) You may program in C/C++ or Java. Your program should assume input in the above format. For initial debugging, it is recommended that you construct a very simple data set (e.g., based on a boolean formula) and test your program on it.
- Implement both accuracy (misclassification impurity) and information gain (entropy impurity) for evaluation criterion. Also, implement split stopping using chi-square test. The test-statistic is:
where pi is the number of promoters in the i-th child, p'i is the expected number of promoters in the i-th child, ni is the number of non-promoters in the i-th child and n'i is the expected number of non-promoters in the i-th child. The degree of freedom is v-1. For more detailed explanation, see 'page 93' of the original ID3 paper [1]. Duda et al. 8.3.3 also provides some details on chi-square test. (If you forgot all you learned in Stat 101, check out Appendix A.6 in Duda et al.)
- Use your algorithm to train a decision tree classifier and report accuracy on validation. Compare accuracies by varying the evaluation criteria and confidence level in determining split stopping. For the latter, use 99%, 95% and 0% (i.e., you always grow the full tree).
-
Turn in the following:
- Your code. Submit your files here. The due date is 12PM, Jan 21, 2010. Your code should contain appropriate comments to facilitate understanding.
- A report of at most 4 pages (letter size, 1 inch margins, 12pt font) that
describes:
- A high-level description on how your code works.
- The accuracies you obtain under various settings.
- Explain which options work well and why.
- If all your accuracies are low, tell us what you have tried to improve the accuracies and what you suspect is failing.
Reference
[1] J. R. Quinlan, 'Induction of decision trees.' Machine Learning 1 (1) (1986): 81-106
Good luck, and have fun!
|