CSE 546, Autumn 2017

Project Ideas

fMRI Brain Imaging

Brain scans were taken of a subject in the process of a word reading task. We want to be able to predict what word the participant is reading based off of the activation patterns in their brain. To do this, we have 218 semantic features for each word in our dictionary (where each feature is a rating from 1-5 answering a question such as "Is it an animal?"). Thus, we can use the fMRI image to predict the semantic features of the word, and then use our dictionary to find our best guess as to which word it is. In this way, we can predict words without ever having seen them in our training set.

Data: fmri.zip (See 3.3.3 of this homework from last quarter's Big Data class for a description of the dataset
Task: Given an image and two candidate words, predict which of those words was being read by the subject.
Background: CMU Background

Job Salary Prediction

This is another task taken from a Kaggle competition. Given an advertisement for a job opening, the goal is to predict the starting salary for the job being posted. Much of the data about the ads is unstructured text (like the ad content itself), but some structured data is given as well. A tree of the geographic relationships between the job locations is also provided. This task is similar to the running example in lecture of predicting starting salary, and has real-world usefulness to the company that posted the problem.

Data: Kaggle Dataset
Task: Predict a salary from a job posting
Background: Kaggle Description

Eigenfaces

The goal of this task is to learn how to recognize faces. We have a set of pictures of 20 people in various directions and expressions, some of which have sunglasses. One major problem with image data is that our input features are individual pixels, which are high-dimensional but not terribly meaningful in isolation. Using PCA, we can decompose our images into eigenvectors, which are linear combinations of pixels (nicknamed "eigenfaces"). Students can explore different classification tasks, from determining the presence of sunglasses to identifying individuals.

Data: Faces Directory
Task: Classify images of faces
Background: .PGM format specification

Student Performance Prediction

Predict student performance on mathematical problems from logs of student interaction. Create a model that deals with challenges of sparsity, temporality, and selection bias. This project was the KDD Cup 2010.

Data: Kdd cup 2010
Task: Predict student performance
Background: Kdd cup 2010

Noisy data

Explore solutions to high dimensionality and irrelevant features. This dataset contains features from a bag-of-words model for Reuters news articles. There are 20,000 features, half of which were added solely to make learning harder. Determine whether each article is about a particular topic.

Data: Dexter dataset
Task: Classify documents into topics.
Background: Dexter description

Coordinate Descent

Examine ways to smartly choose coordinates to update. One possibility is clustering. This project could potentially present some theory to motivate an approach.

Data: You'd have to get a dataset.
Task: Speed up coordinate descent.
Background: JMLR NIPS

Ensemble Learning

For Yelp competition, train multiple models to predict the number of upvotes for a review. Combine the models into a single prediction using ensemble methods, and compare your results on the public leaderboard.

Cost-sensitive Classification

Develop algorithms for Cost-sensitive classification, where the cost of misclassifying certain classes is different from other classes.

Data: Spam Dataset or any dataset where there is a cost-sensitive task. Spam and health datasets are the traditional examples.
Background: IJCAI'01

Class Imbalance

Exploring ways of dealing with class imbalance (optimizing for precision/recall or auc). This is something that showed up in hw2, the 'solution' there is too naive.

Data and background: This link has many datasets where imbalance is present.

Random Features and Feature Generation

It is very important to create features in machine learning tasks. Random features and Nystrom methods are one method (the viewpoint is to approximate the kernel matrix).

Data and background: This paper and code are a good place to start. You are free to explore theory and experimental components here. You can look at this method, Nystrom methods, or cook up other random feature generation methods (like using random logistic functions). You are also free to try deterministic methods or just be creative. Many datasets are applicable here.

Optimization

There are a few modern optimization algorithms appropriate for sums of convex functions (relevant for many machine learning applications). SVRG and SDCA are two notable examples. Explore how well these work and the theory.

Data and background: These two papers (on variance reduction SVRG and dual coordinate ascent SDCA) provide very nice algorithms. You are free to explore theory and experimental components here (and can use any of the labeled datasets above or use mnist).

Word Embeddings

Explore ways of making vector representations of words. Two good places to start are:.

Data and background: These papers (paper by Stratos, Kim, Collins, and Hsu and paper by Stratos, Collins, and Hsu) and this code (link link link) are interesting places to start. This would be an interesting project for anyone with NLP interests or unsupervised learning interests.

Structured Prediction

Exploring ways of doing structured prediction, i.e. prediction of sequences.

Data and background: This link to CRF code can be helpful for exploring structured prediction and named entity recognition. You can explore the importance of features or other ideas/models.

Adaptive Data Analysis and Leaderboards

Providing valid confidence intervals and avoiding overfitting are increasingly important questions. In most applications, algorithms do in fact adapt their behavior after tests on the holdout set. Even in competitions, please respond to the "leader board" adaptively.

Data and background: These papers (paper and paper) and code are good places to start. You are free to be creative in your questions.

Character level language models and sequence modeling

Recurrent neural nets and LTSMs are interesting. It is not at all obvious what is going on here, so there are host of open ended questions. Even an exploratory project comparing a few ideas/methods could make an interesting project.

Some interesting blog posts: Recurrent neural nets and n-gram models

Privacy

As data-driven prediction becomes increasingly pervasive, your personal data used to train models is becoming more and more exposed and in some cases can even be inferred, even after attempts at anonymization. Explore the theory and algorithms behind this interesting and societally important topic.

A survey to get you started: Differential Privacy and Machine Learning: a Survey and Review

Fairness

Machine learning is playing an unprecedented role in who gets loans, who gets admitted into what school, and who gets to keep their job. Many times we are concerned with not treating protected groups (e.g., race, gender, etc.) differently. Explore the different ways machine learning algorithms can increase fairness.

Blog post to get your started: post
You may also find Kathy O'Neil's Weapon's of Math Destruction book for a general audience interesting

Fake News

There is growing evidence that fake news on sites like Facebook had an impact on the outcome of the 2016 presidential election, and it is not a new problem. One group tried to see if they could use machine learning to automatically identify fake news. Explore why this problem is uniquely hard, and how methods have attempted to target it.

Fake new challenge: website