DataSets

The complete training dataset has 234k instances, which may be too large for development purposes. We are providing a subset (40k instances) that has been discretized so that numerical attributes are in one of five bins. For this assignment it is enough to use only this discretized subset. You are encouraged to make use of the complete data if possible, but it will require implementing support for numerical attributes. You can expect to get better results if you do so.

Discretized data:
Training Data Subset (40k instances, 115MB after unzipping)
Test Data, using same bins as above (25k instances, 72MB after unzipping)

Unfiltered data, contains both nominal and numerical attributes (not required for assignment):
Training Data Subset (40k instances, 40MB after unzipping)
Training Data Complete (234k instances, 250MB after unzipping)
Test Data (25k instances, 25MB after unzipping)