Your class project is an opportunity for you to explore an interesting machine learning / statistics problem of your choice in the context of a real-world big data set. We will provide some project ideas, but the best idea would be to combine the topics of this course with problems in your own research area. Your class project must be about new things you have done this semester; you can't use results you have developed in previous semesters.
Projects can be done by you as an individual, or in teams of two students. Each project will also be assigned a project consultant/mentor. They will consult with you on your ideas, but of course the final responsibility to define and execute an interesting piece of work is yours. Your project will be worth 35% of your final class grade, and will have two final deliverables:
a writeup in the format of a NIPS paper (8 pages maximum in NIPS format, including references; this page limit is strict), due March 19 (emailed to the instructors list, NO LATE SUBMISSION ACCEPTED since we need to get your grades in), worth 60% of the project grade.
a poster presenting your work for Big Data class poster session on Friday March 15, 2-4pm in the Atrium of the Allen building, worth 20% of the project grade.
In addition, you must turn in a midway progress report (3 pages maximum in NIPS format, including references) describing the results of your first experiments by Feburary 26, worth 20% of the project grade. Note that, as with any conference, the page limits are strict! Papers over the limit will not be considered.
turn in a brief project proposal (1-page maximum) by January 29 (Tuesday).
Read the list of available data sets and potential
project ideas below.
You are encouraged to use the data sets from the case studies.
If you prefer to use a
different data set, we will consider your proposal, but you must have
access to this data already, and present a clear proposal for what you
would do with it.
Project proposal format: Proposals should be one page maximum. Include the following information:
Project idea. This should be approximately two paragraphs.
Software you will need to write.
Papers to read. Include 1-3 relevant papers. You will probably want to read at least one of them before submitting your proposal.
Teammate: will you have a teammate? If so, whom? Maximum team size is two students.
Milestone: What will you complete by the milestone (TBA)? Experimental results of some kind are expected here.
Here are some details on the poster format.
- We will provide poster boards that are 32x40. The two most common ways to make your poster are:
- You can create a huge slide (in powerpoint, using beamer, etc) and print it out at one of the poster printers available.
- You can create a bunch of "normal" presentation slides, print out each one on a piece of (letter-sized) paper, and put them all together on a poster board. This is somewhat messier (and not as pretty, but we don't mind) and you don't need to wait for the special poster printer to print.
Network Datasets and collections:
- A collection of moderately-sized to larger datasets
- Twitter 2010 graph with 1.4B edges
- A collection of moderately sized datasets
- Trec datasets
- KDD cup datasets
- Click-through rate dataset from HW1
- Lemur project
- The Graph 500 Benchmark
- UW is undertaking one of the largest Hubble Space Telescope (HST) surveys to date, leading to data with 100 gigapixels. More details below.
- GraphLab: http://graphlab.org
- GraphChi: http://graphchi.org
- Hadoop: http://hadoop.apache.org/
- Spark: http://spark-project.org/
Large-scale clustering and social network analysis:
- Implement and evaluate distributed clustering and community detection algorithms. Use GraphLab or GraphChi to scale up algorithms.
- Implement stochastic block model to analyze communities using Twitter data or web graph (see papers related to scalable stochastic block models in the Readings)
Large-scale matrix factorization:
- Implement matrix factorization approaches for collaborative filtering using a distributed framework, e.g., GraphLab, GraphChi, Spark, etc. Evaluate on movie recommendation dataset or for topic modeling with Wikipedia data.
- Use Wikipedia information on movies plus Netflix data to create a predictor addressing the cold-start problem (see the paper by Menon and Elkan in the Readings)
Large-scale hierarchical text classification:
- Using DMOZ/Wikipedia data, build a large-scale document classifier. Use hierarchy of topics for multi-task learning. Exploit and evaluate hashing and sketching techniques for scaling up, online learning and a range of classifiers. Use parallelism to scale up algorithms.
Parallel algorithm implementation and analsys:
- Parallelize Gibbs sampling in LDA (see distributed LDA paper) or mixture of Gaussian application using GraphLab. Amr Ahmed, Mohamed Aly, Joseph Gonzalez, Shravan Narayanamurthy, Alex Smola (2012). "Scalable Inference in Latent Variable Models." Conference on Web Search and Data Mining (WSDM)
- Compare Shotgun, Hogwild! and distributed averaging on a range of datasets (see linked Shotgun paper in the Readings)
Large-scale image classification or retrieval:
- Computing sift or other feature extraction for huge image datasets (data-parallel, ideal for Map-Reduce)
- Vector quantization and clustering (not data-parallel, may use GraphLab)
- Bag-of-words histogram representation + SVMs/logistic regression for classification or fast-approximate nearest neighbor methods for retrieval.
- Perform stochastic variational inference for an LDA model using a large document database such as the Wikipedia data from class (see stochastic variational inference papers in Readings)
Spectral clustering of large image datasets. Related papers:
- Fast approximate spectral clustering, Donghui Yan, Ling Huang, and Michael I. Jordan, KDD09
- Parallel spectral clustering in distributed systems, Wen-Yen Chen, Yangqiu Song, Hongjie Bai, Chih-Jen Lin, and E. Y.Chang, Pattern Analysis and Machine Intelligence 2011
- Spectral grouping using the nystrom method, Charless Fowlkes, Serge Belongie, Fan Chung, and Jitendra Malik, Pattern Anal. Mach. Intell. 2004
- Fast spectral clustering with random projection and sampling, Tomoya Sakai and Atsushi Imiya, MLDM09
- A Fast Incremental Spectral Clustering for Large Data Sets, Tengteng Kong, Ye Tian, Hong Shen, PDCAT11
UW is undertaking one of the largest Hubble Space Telescope (HST) surveys to date, producing an image of the northern half of the large spiral galaxy M31. The final mosaic covers 414 HST pointings taken in 6 different filters ranging from the ultra-violet to the infrared portion of the electromagnetic spectrum. The spatial resolution of the imaging is 0.05 arcseconds over the 6 million square arcseconds of the survey. This exquisity resolution results in 120 megapixel images in each of 6 bands. Since each band is imaged multiple times and there are large overlaps between fields for optimizing calibration, our full data set contains nearly 100 gigapixels.
There are a number of possible project with this dataset, including ones around learning distance metrics and density estimation in high dimensions. Professor Magdalena Balazinska