CSE 546: Mini-project Guidelines

CSE 546: Mini-project Guidlelines

Instead of a final exam, you should complete a mini-project. It can be on any ML-related topic, including those we have not covered in class. Examples are listed below.

Given the time constraints, the goal would be to put in as much time as it would take to do 1.5-2 homeworks. Of course, this is more challenging since you have to also define the project yourself. But, we hope it is also much more fun!

Collaboration: You can work alone or in groups of two. Groups are expected to do twice the work.

Proposal Date: Send Luke a short email describing your proposed project as soon as you have an idea, but definitely before the end of day on Monday, Feb. 26. Please come to office hours, or contact us if you need help deciding on a topic.

Due Date: Friday Mar. 16th, 5pm.

Submit: A final project report and a single compressed file containing source code with instructions describing how it should be run. The project report should be in pdf format with no more than 4 pages of primary content. You are allowed unlimited space for the citations and appendices, starting on page 5, but your story should be complete and understandable without reading this extra material. Group projects can have 6 pages of primary content. You should upload the files to the CSE 546 DropBox.

Project Ideas: A strong project will demonstrate understanding of topics in machine learning that are beyond the scope of what we covered in class. This can be done by, for example:

Implementing an algorithm that we didn't have time to cover from a ML book or research paper (see list below). As a intermediate step, be sure to demonstrate interesting learning behavior on toy or simulated data. Here, you might explore issues such as overfitting, model selection, etc. A further goal would be replicate the results from a paper, but this can be surprisingly difficult to achieve in practice.
Applying an existing algorithms to a new problems (see list of software below). In this case, you would be welcome to use data from your own research. A strong project would carefully describe the new problem, why the application is appropriate, the results achieved, and include an summary of what was learned from the exercise. Negative results can be interesting if you describe why you originally though the approach would work.
Teaching yourself new ML topics and completing existing homeworks in this area. For example, you might study reinforcement learning and complete the Pac-man RL homework from last year's CSE573 course. Any topic is fine and you could also design your own assignment, as long as it demonstrates that you learned new ML topics. For this case, you could have a relatively short write-up describing what you are doing and also submit the homework with solutions.
Other ideas of similar size and complexity are welcome. Feel free to pitch them to Luke if you are unsure.

Research Papers (in no particular order; feel free to suggest others)

Supervised Classification

Ryan Rifkin and Aldebaro Klautau, In Defense of One-vs-All Classification. Journal of Machine Learning Research, Volume 5 (Jan): 101-141, 2004.
Andrew Y. Ng and Michael I. Jordan, On Discriminative vs. Generative Classifiers: A Comparison of Logistic Regression and Naive Bayes. Advances of Neural Information and Processing System, 2001.
Yoav Freund and Robert E. Schapire, Large Margin Classification Using the Perceptron Algorithm. Machine Learning, Volume 37, Issue 3, 1999.
Robert E. Schapire, Yoav Freund, Peter Bartlett, and Wee Sun Lee, Boosting the Margin: A New Explanation for the Effectiveness of Voting Methods. The Annals of Statistics, Volume 26, Issue 5, 1998.
Thorsten Joachims, Text Categorization with Suport Vector Machines: Learning with Many Relevant Features. Proceedings of the European Conference on Machine Learning, 1998.

Semi-supervised Learning

Transductive Inference for Text Classification Using Support Vector Machines. Thorsten Joachims. Proceedings of International Conference on Machine Learning, 1999.
Semi-Supervised Text Classification Using EM. Kamal Nigam, Andrew McCallum and Tom Mitchell. In Chapelle, O., Zien, A., and Scholkopg, B. (Ed Semi-Supervised Learning. MIT Press: Boston. 2006.
Combining Labeled and Unlabeled data with Co-Training. Avrim Blum and Tom Mitchell. Proceedings of the 11th Annual Conference on Computational Learning Theory, 1998.

Unsupervised Learning

Lawrence K. Saul and Sam T. Roweis, Think Globally, Fit Locally: Unsupervised Learning of Low-Dimensional Manifolds, Journal of Machine Learning Research, Volume 4, 119-155, 2003.
Michael E. Tipping and Christopher M. Bishop, Probabilistic Principal Component Analysis, Journal of Royal Statistics Society (B), Volume 61, Part 3, 611-622, 1999.

Machine Learning and Vision

Robust Real-time Object Detection. Paul Viola, Michael Jones. IJCV 2001. [Wikipedia page]
Eigenfaces [Wikipedia page]

Structured Prediction Models for Tagging in NLP

Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data. John Lafferty, Andrew McCallum, Fernando Pereira. In Proceeding of the International Conference on Machine Learning (ICML), 2001. [Implementation available in the MALLET toolkit, linked below]
Discriminative Training Methods for Hidden Markov Models: Theory and Experiments with Perceptron Algorithms. Michael Collins. In Proceeding of the Conference on Emperical Methods in Natural Language Processing (EMNLP), 2002.

E-mail Spam Filtering

A Bayesian approach to filtering junk e-mail. M. Sahami, S. Dumais, D. Heckerman, E. Horvitz. AAAI Workshop on Learning for Text Categorization, 1998.
Boosting Trees for Anti-Spam Email Filtering. Xavier Carreras and Lluis Marquez, Conference on Recent Advances in NLP (RANLP'01). Tzigov Chark, Bulgaria. 2001.
Support vector machines for spam categorization. Drucker, H.; Donghui Wu; Vapnik, V.N.; Proceeding of IEEE Transactions on Neural Networks, 2002.
Better bayesian filtering. Paul Graham. Summary from talk at 2003 Spam Conference. Highlights practical issue in building high performance Bayesian spam filters.

Software Packages (feel free to suggest others)

Matlab Statistics Toolbox
scikit.learn: Python machine learning modules
SVMlight: Machine Learning with Support Vector Machines
Alchemy: A toolkit for Markov Logic Network inference and learning
MALLET: Java code, MAchine Learning for LanguagE Toolkit