Homework 4B
Machine Learning Assignment
Assigned 11/28/01
Notes:
You may work in teams of up to three people for this assignment. You may
*not* work with anyone that you have previously worked with during this
class.
Due Date: Monday December 17th
Assignment:
1. Write a Naive Bayes classifier. Look at the format of the data sets
before you start programming.
2. Download and get C4.5 working.
(http://www.cse.unsw.edu.au/~quinlan/c4.5r8.tar.gz)
3. Download the following datasets from the UCI machine learning website
(http://www1.ics.uci.edu/~mlearn/MLSummary.html):
Chess Endgames(king-rook vs king-pawn)
Congressional Voting Records
Echocardiogram
Hepatitis
Horse Colic
Hypothyroid
Labor Relations
Lung Cancer
Post-operative Patient
Promoters Gene Sequencers
Sonar
Soybean
4. Perform a 10-fold cross-validation accuracy study of the performance of
your Naive Bayes classifier and Quinlan's C4.5 decision tree
implementation. For each of the datasets above report the mean and
standard deviation of each machine learning algorithm across each fold.
Do not differentiate between different types of errors (for example false
positive and false negative) Discretize any continuous attributes into 10
equal sized bins before training/testing your machine learners. Make sure
you include the -s option when you run C4.5.
5. Create two additional datasets with at least 100 examples. Using the
same procedures as above, demonstrate that your Naive Bayes implementation
algorithm is more accurate on the first data set and that C4.5 is more
accurate on the second data set. If the mean accuracy plus or minus one
standard deviation overlaps between the two machine learners on one data
set, in addition to the mean and standard deviation, demonstrate the
statistical significance of your results using an appropriate test. (for
example a paired t-test or paired Wilcoxon signed rank test)
6. Turn-in a hard-copy of the table of results from 2, a discussion of the
characteristics of the datasets in part 5, and a cover sheet identifying
the group members, their email addresses and signatures acknowledging that
this is your group's independent work.