For this project we'll be using some machine learning software in java called WEKA which has a large variety of learners implemented and is set up to automatically perform experiments using cross-validation.

You can install WEKA and the Java runtime environment from this location.

It's very easy to run. Here's an example:

java -cp weka.jar weka.classifiers.neural.NeuralNetwork -t data\iris.arff

This will run the neural network classifier on the "data\iris.arff" file, show the neural network model, and evaluate it using cross-validation.

You can also download extra datasets from the UCI Machine Learning Repository in the WEKA arff file format (the datasets are described here).

Run the following three classifiers on the labor data included with WEKA.

This is the decision tree classifier. It is based on C4.5.

java -cp weka.jar weka.classifiers.j48.J48 -t data\labor.arff

This is the neural network classifier.

java -cp weka.jar weka.classifiers.neural.NeuralNetwork -t data\labor.arff

This is a simple naive bayes classifier.

java -cp weka.jar weka.classifiers.NaiveBayes -t data\labor.arff

Note the model, training set accuracy, and cross-validation accuracy in the ouput for each execution.

  1. Rank the classifiers in terms of accuracy on the training set.
  2. Rank the classifiers in terms of accuracy in testing through cross-validation.
  3. Rank the classifiers in terms of model size (select a measurement of model size that you feel is reasonable).
  4. Rank the classifiers in terms of learning time.
  5. Which classifier has the greatest discrepancy between training set and cross-validation test set accuracy? Why might that be?

Pick a dataset in arff format from the UCI machine learning datasets, run the experiments described above, and answer the same questions. You can extract the data files from the datasets-UCI.jar file with the following command.

jar xvf datasets-UCI.jar