- Mon., October 24 at 5pm: Project Proposals
- Mon., November 14 at 5p,: Project Milestone
- Thu., December 8, 9-11:30am: Poster Session
- Thu., December 15, at 10am: Project Report
Your Course Project
Your class project is an opportunity for you to explore an interesting machine learning problem in the context of a real-world data set. You are also free to explore theoretical and algorithmic ideas, though you must have a data component. We will provide some seed project ideas. You can pick one of these ideas, and explore the data and algorithms within and beyond what we suggest. You can also use your own data/ideas, but, in this case, you have to make sure you have the data available now and a nice roadmap, since a quarter is too short to explore a brand new concept.
Projects can be done by you as an individual, or in teams of two students. You can discuss your ideas and approach with the instructor/TAs. The final responsibility to define and execute an interesting piece of work is yours.
The final project is worth 20% of your grade, which will be split amongst three deliverables:
- A project milestone (10% of the grade).
- A project poster presentation (20% of the grade).
- A final report (70% of the grade).
Your project will be evaluated by the following criteria:
- Technical Depth: How technically challenging was what you did?
- Programming Depth: Did you use a package or write your own code? It is fine if you use a package, though this means other aspects of your project must be more ambitious.
- Data Depth: How challenging was dealing with the data set that you used?
- Scope: How broad was your project? How many aspects, angles, variations did you explore?
- Presentation: How well did you explain what you did, your results, and interpret the outcomes? Did you use the good graphs and visualizations? How clear was the writing?
You must turn in a project proposal through Canvas.
Read the list of available data sets and potential
project ideas below.
If you prefer to use a
different data set, we will consider your proposal, but you must have
access to this data already, and present a clear proposal for what you
would do with it.
Project proposal format: Proposals should be one page maximum. Include the following information:
-
Project title
-
Data set
-
Project idea. This should be approximately two paragraphs.
-
Software you will need to write.
-
Papers to read. Include 1-3 relevant papers. If you are doing something different then one of the suggested projects, you will probably want to read at least one of them before submitting your proposal.
-
Teammate: will you have a teammate? If so, whom? Maximum team size is two students. One proposal per team.
-
Milestone: What will you complete by the milestone? Experimental results of some kind are expected here.
A project milestone should be submitted on Canvas. Your write up should be 3 pages maximum in NIPS format, not including references (the templates are for LaTex, if you want to use other editors/options please try to get close to the same format). You should describe the results of your first experiments here. Note that, as with any conference, the page limits are strict! Papers over the limit will not be considered.
We will hold a poster session in the Atrium of the Paul Allen Center. Each team will be given a stand to present a poster summarizing the project motivation, methodology, and results. The poster session will give you a chance to show off the hard work you put into your project, and to learn about the projects of your peers.
Here are some details on the poster format:
- We will provide poster boards that are 32x40.
- Both one large poster (recommended) and several pinned pages are OK.
Your final submission will be a project report on Canvas. Your write up should be 8 pages maximum in NIPS format, not including references (the templates are for LaTex, if you want to use other editors/options please try to get close to the same format). You should describe the task you solved, your approach, the algorithms, the results, and the conclusions of your analysis. Note that, as with any conference, the page limits are strict! Papers over the limit will not be considered.
The course staff has outlined several potential project ideas below. This should give you a sense of the datasets available and an appropriate scope for your project. You can either pick one of these or come up with something of your own to work on.
Structured Handwritten Digit RecognitionAn ongoing project in the department aims to accurately classify structured handwritten digits (written over a box resembling a seven-segment display). The box reduces the variance between representations of the same digit, improving the robustness of classification. The current algorithm achieves ~88% accuracy, and your goal is improve this using various machine learning techniques. For more information, see the background below.
- Task: Classify structured handwritten digits.
- Background: Structured Handwritten Digit Recognition Background
As this dataset is heavily used in class, it is recommnded that you try this only if you find novel questions.
fMRI Brain ImagingBrain scans were taken of a subject in the process of a word reading task. We want to be able to predict what word the participant is reading based off of the activation patterns in their brain. To do this, we have 218 semantic features for each word in our dictionary (where each feature is a rating from 1-5 answering a question such as "Is it an animal?"). Thus, we can use the fMRI image to predict the semantic features of the word, and then use our dictionary to find our best guess as to which word it is. In this way, we can predict words without ever having seen them in our training set.
- Data: fmri.zip (See 3.3.3 of this homework from last quarter's Big Data class for a description of the dataset
- Task: Given an image and two candidate words, predict which of those words was being read by the subject.
- Background: CMU Background
This is another task taken from a Kaggle competition. Given an advertisement for a job opening, the goal is to predict the starting salary for the job being posted. Much of the data about the ads is unstructured text (like the ad content itself), but some structured data is given as well. A tree of the geographic relationships between the job locations is also provided. This task is similar to the running example in lecture of predicting starting salary, and has real-world usefulness to the company that posted the problem.
- Data: Kaggle Dataset
- Task: Predict a salary from a job posting
- Background: Kaggle Description
The goal of this task is to learn how to recognize faces. We have a set of pictures of 20 people in various directions and expressions, some of which have sunglasses. One major problem with image data is that our input features are individual pixels, which are high-dimensional but not terribly meaningful in isolation. Using PCA, we can decompose our images into eigenvectors, which are linear combinations of pixels (nicknamed "eigenfaces"). Students can explore different classification tasks, from determining the presence of sunglasses to identifying individuals.
- Data: Faces Directory
- Task: Classify images of faces
- Background: .PGM format specification
Predict student performance on mathematical problems from logs of student interaction. Create a model that deals with challenges of sparsity, temporality, and selection bias. This project was the KDD Cup 2010.
- Data: Kdd cup 2010
- Task: Predict student performance
- Background: Kdd cup 2010
Explore solutions to high dimensionality and irrelevant features. This dataset contains features from a bag-of-words model for Reuters news articles. There are 20,000 features, half of which were added solely to make learning harder. Determine whether each article is about a particular topic.
- Data: Dexter dataset
- Task: Classify documents into topics.
- Background: Dexter description
Examine ways to smartly choose coordinates to update. One possibility is clustering. This project could potentially present some theory to motivate an approach.
Ensemble LearningFor Yelp competition, train multiple models to predict the number of upvotes for a review. Combine the models into a single prediction using ensemble methods, and compare your results on the public leaderboard.
Cost-sensitive ClassificationDevelop algorithms for Cost-sensitive classification, where the cost of misclassifying certain classes is different from other classes.
- Data: Spam Dataset or any dataset where there is a cost-sensitive task. Spam and health datasets are the traditional examples.
- Background: IJCAI'01
Exploring ways of dealing with class imbalance (optimizing for precision/recall or auc). This is something that showed up in hw2, the 'solution' there is too naive.
- Data and background: This link has many datasets where imbalance is present.
It is very important to create features in machine learning tasks. Random features and Nystrom methods are one method (the viewpoint is to approximate the kernel matrix).
- Data and background: This paper and code are a good place to start. You are free to explore theory and experimental components here. You can look at this method, Nystrom methods, or cook up other random feature generation methods (like using random logistic functions). You are also free to try deterministic methods or just be creative. Many datasets are applicable here.
There are a few modern optimization algorithms appropriate for sums of convex functions (relevant for many machine learning applications). SVRG and SDCA are two notable examples. Explore how well these work and the theory.
- Data and background: These two papers (on variance reduction SVRG and dual coordinate ascent SDCA) provide very nice algorithms. You are free to explore theory and experimental components here (and can use any of the labeled datasets above or use mnist).
Explore ways of making vector representations of words. Two good places to start are:.
- Data and background: These papers (paper by Stratos, Kim, Collins, and Hsu and paper by Stratos, Collins, and Hsu) and this code (link link link) are interesting places to start. This would be an interesting project for anyone with NLP interests or unsupervised learning interests.
Exploring ways of doing structured prediction, i.e. prediction of sequences.
- Data and background: This link to CRF code can be helpful for exploring structured prediction and named entity recognition. You can explore the importance of features or other ideas/models.
Providing valid confidence intervals and avoiding overfitting are increasingly important questions. In most applications, algorithms do in fact adapt their behavior after tests on the holdout set. Even in competitions, please respond to the "leader board" adaptively.
- Data and background: These papers (paper and paper) and code are good places to start. You are free to be creative in your questions.
Recurrent neural nets and LTSMs are interesting. It is not at all obvious what is going on here, so there are host of open ended questions. Even an exploratory project comparing a few ideas/methods could make an interesting project.
- Some interesting blog posts: Recurrent neural nets and n-gram models