Click Predition


This dataset is from 2012 KDD Cup Track 2, where the task is to predict the click through rate of ads given a query, the ads(link) information, and user information. Please click here for a detail description of the data.
The original dataset is divided into 3 parts: training, testing, and maps from feature id to features. The training set has 150M instances, and the testing data has 20M instances. We subsampled and simplified this dataset by joining the training and testing data with the feature maps.

Wikipedia & BBC

This dataset contains a small subset of documents from Wikipedia and BBC news. The wikipedia dataset has both text and tf-idf matrix; the BBC dataset comes with word count only.

fMRI Brain Imaging Data


This data set contains a time series of images of brain activation, measured using fMRI, with one image every 500 msec. During this time, human subjects performed 360 trials of a word reading task. Each image contains approximately 21,000 voxels (3D pixels), across a large portion of the brain. Data is available for 1 human subjects.



This dataset is the training set of Netflix Challenge, containing 99,072,112 ratings from user ids to movie ids. Each line is in the form of "userid movieid rating". netflix_mm.gz MovieInfo.mat