DescriptionThis dataset is from 2012 KDD Cup Track 2, where the task is to predict the click through rate of ads given a query, the ads(link) information, and user information. Please click here for a detail description of the data.
The original dataset is divided into 3 parts: training, testing, and maps from feature id to features. The training set has 150M instances, and the testing data has 20M instances. We subsampled and simplified this dataset by joining the training and testing data with the feature maps.
Wikipedia & BBCThis dataset contains a small subset of documents from Wikipedia and BBC news. The wikipedia dataset has both text and tf-idf matrix; the BBC dataset comes with word count only.