Important Dates

Your Course Project

Your class project is an opportunity for you to explore an interesting machine learning problem in the context of a real-world data set. We will provide some seed project ideas. You can pick one of these ideas, and explore the data and algorithms within and beyond what we suggest. You can also use your own data/ideas, but, in this case, you have to make sure you have the data available now and a nice roadmap, since a quarter is too short to explore a brand new concept.

Projects can be done by you as an individual, or in teams of two students. You can discuss your ideas and approach with the instructors, but of course the final responsibility to define and execute an interesting piece of work is yours.

The final project is worth 20% of your grade, which will be split amongst three deliverables:

Your project will be evaluated by three criteria:

The Technical Depth and Scope are complementary criteria, e.g., if you develop a single elaborate algorithm or model on a small dataset, you may score high on depth but low on scope, while if you try many very simple methods on different datasets, your scope would be higher but the depth lower.


Project Proposal

You must turn in a project proposal on Wednesday October 23rd at 9:00am through Catalyst.

Read the list of available data sets and potential project ideas below. If you prefer to use a different data set, we will consider your proposal, but you must have access to this data already, and present a clear proposal for what you would do with it.

Project proposal format: Proposals should be one page maximum. Include the following information:


Project Milestone

A project milestone should be submitted on November 11th at 9:00am via Catalyst. Your write up should be 3 pages maximum in NIPS format, not including references (the templates are for LaTex, if you want to use other editors/options please try to get close to the same format). You should describe the results of your first experiments here. Note that, as with any conference, the page limits are strict! Papers over the limit will not be considered.


Poster Session

We will hold a poster session on December 4th from 3:00-5:00pm in the Atrium of the Paul Allen Center. Each team will be given a stand to present a poster summarizing the project motivation, methodology, and results. The poster session will give you a chance to show off the hard work you put into your project, and to learn about the projects of your peers.

Here are some details on the poster format:


Project Report

Your final submission will be a project report on December 9th at 9:00am via Catalyst. Your write up should be 8 pages maximum in NIPS format, not including references (the templates are for LaTex, if you want to use other editors/options please try to get close to the same format). You should describe the task you solved, your approach, the algorithms, the results, and the conclusions of your analysis. Note that, as with any conference, the page limits are strict! Papers over the limit will not be considered.


Project Ideas

The course staff has outlined several potential project ideas below. This should give you a sense of the datasets available and an appropriate scope for your project. You can either pick one of these or come up with something of your own to work on.

Netflix Challenge

From 2006-2009, Netflix sponsored a competition to improve its movie recommendation system. Explore ideas for improving upon the basic matrix factorization technique (matrix factorization was widely used). Some possibilities include incorporating meta information or trying local matrix factorization (see links).

fMRI Brain Imaging

Brain scans were taken of a subject in the process of a word reading task. We want to be able to predict what word the participant is reading based off of the activation patterns in their brain. To do this, we have 218 semantic features for each word in our dictionary (where each feature is a rating from 1-5 answering a question such as "Is it an animal?"). Thus, we can use the fMRI image to predict the semantic features of the word, and then use our dictionary to find our best guess as to which word it is. In this way, we can predict words without ever having seen them in our training set.

Digit Recognition

Implement handwriting recognition by classifying pictures (stored as pixel data) as the appropriate digit. This project is based off a tutorial ML competition hosted on

Job Salary Prediction

This is another task taken from a Kaggle competition. Given an advertisement for a job opening, the goal is to predict the starting salary for the job being posted. Much of the data about the ads is unstructured text (like the ad content itself), but some structured data is given as well. A tree of the geographic relationships between the job locations is also provided. This task is similar to the running example in lecture of predicting starting salary, and has real-world usefulness to the company that posted the problem.


The goal of this task is to learn how to recognize faces. We have a set of pictures of 20 people in various directions and expressions, some of which have sunglasses. One major problem with image data is that our input features are individual pixels, which are high-dimensional but not terribly meaningful in isolation. Using PCA, we can decompose our images into eigenvectors, which are linear combinations of pixels (nicknamed "eigenfaces"). Students can explore different classification tasks, from determining the presence of sunglasses to identifying individuals.

Student Performance Prediction

Predict student performance on mathematical problems from logs of student interaction. Create a model that deals with challenges of sparsity, temporality, and selection bias. This project was the KDD Cup 2010.

Noisy data

Explore solutions to high dimensionality and irrelevant features. This dataset contains features from a bag-of-words model for Reuters news articles. There are 20,000 features, half of which were added solely to make learning harder. Determine whether each article is about a particular topic.

Coordinate Descent

Examine ways to smartly choose coordinates to update. One possibility is clustering. This project could potentially present some theory to motivate an approach.

Ensemble Learning

For Yelp competition, train multiple models to predict the number of upvotes for a review. Combine the models into a single prediction using ensemble methods, and compare your results on the public leaderboard.

Cost-sensitive Classification

Develop algorithms for Cost-sensitive classification, where the cost of misclassifying certain classes is different from other classes.

Class Imbalance

Exploring ways of dealing with class imbalance (optimizing for precision/recall or auc). This is something that showed up in hw2, the 'solution' there is too naive.