One of the main goals of CSEP 590A is to prepare you to apply state-of-the-art data mining tools and algorithms to an application. If you are interested in research, CSEP 590A will also leave you well-qualified to do data mining research. The class's final project will offer you an opportunity to do exactly this.
Students can (and are strongly encouraged) to work in teams of up to three people. If you have a project of such large scope and ambition that it cannot be done by a team of only three, you may propose doing a project in a team of four. Please sign up with your group information using the Google form.
Your first task is to pick a project topic. If you are looking for project ideas, please come to either Prof. Althoff or the TAs' office hours, and we'd be happy to brainstorm and suggest some project ideas.
In the meantime, there can be three kinds of course projects:
Ideally, projects will be a mix of the three types of projects outlined above. As with the reaction paper, the project should contain at least some amount of mathematical analysis, and some experimentation on real or synthetic data.
Many fantastic class projects come from students picking either an application/dataset that they're interested in, or picking some sub-field of data mining that they want to explore more, and working on that as their project. If you haven't worked on a research project before but would like to, you can also use this as an opportunity to try your hand at it. (Just be sure to ask us for help if you're uncertain how to best get started.)
Alternatively, if you're already working on a research project that data mining might be applicable to, then working out how to apply data mining techniques to it will often make a very good project topic. Similarly, if you currently work in industry and have an application on which data mining might help, that could also make a great project.
A very good CSEP 590A project will comprise a publishable or nearly-publishable piece of work. Each year, some number of students continue working on their projects after completing CSEP 590A, and submit their work to a conference or journal.
Projects will be evaluated based on:
Lastly, a few words of advice: Many of the best class projects come from students working on topics that they're excited about. So, pick something that you can get excited and passionate about! Be brave rather than timid, and do feel free to propose ambitious things that you're excited about. Finally, if you're not sure what would or would not make a good project, please also feel strongly encouraged to either email us or come to office hours to talk about project ideas.
There are four deliverables (Click the respective deliverable to know more):
You can work in groups of 3 people on the project.
This assignment consists of two parts. The first part is a ~2-page reaction paper to several published research papers, and the second part is a ~2-page proposal for the project you want to pursue for this class. Two parts should be related: your reactions to published papers should inform the project you will work on.
Proposals will be evaluated subjectively based on the following criteria (same criteria as listed in the above section for final project evaluation):
Please use the NeurIPS 2019 template as given here or here. If you are not using LaTeX, you may follow the formatting instructions given in Section 2 of the links.
Part 1: Reaction paper component (~2 pages). The course is based on material from the last few years. This means that most of it in form of research papers, which raise lot of interesting issues that have yet to be explored. The goal of the reaction paper section is that students familiarize themselves more in depth with the material covered in class, do reading beyond what was covered in class.Students will pick at three (or more) papers that are clearly related to course topics (e.g., mentioned in class or office hours; if in doubt check with course staff about the papers you aim to read). These papers should go beyond material that was covered in detail in a lecture (i.e., don't only discuss required readings or the textbook). Students should carefully read the papers and write a short (approximately 2 pages) reaction paper about the content of the chosen papers. You should be thinking beyond what you read, and not just take other people's work for granted. The reaction part of the paper should address the following questions:
Reaction papers should not just be summaries of the papers you read. The last two bullets should form the most substantial part of the document. Answering these questions can be a very good way to explore a potential project topic. The reaction paper should be concluded with a section with a description of some promising further research directions and questions, and how could they be pursued.
In prior versions of the course, the reaction paper has been a very good way to explore a potential project topic.
Part 2: Project proposal (~2 pages). The project proposal component should build on the reaction paper component. The purpose of the reaction paper is to survey the related work and identify what are strengths and weaknesses of the papers and how they may be addressed. The proposal should then focus on what are some promising further research directions and questions: How precisely do you plan to pursue them? What methods/data do you plan to use? You should try to provide a concrete proposal for a model or algorithm that potentially extends or improves the topics discussed in the papers you've read.
When writing the proposal you should try to answer the following questions:Some other points to note:
We strongly encourage you to work in groups of 3 people. It is hard for us to balance the grading based on the group size. This means that projects will be graded about the same regardless of how many people are in the group -- working in groups is strongly encouraged!
The final project report should be a 5-10 page paper, describing the introduction, related work, approach, results and conclusion. We will not accept reports longer than 10 pages (page count includes figures, but excludes references). At the end of the report, you should also highlight the contributions of individual team members to the project (in the format outlined below). The project report should contain at least some amount of mathematical analysis, and some experimentation on real or synthetic data.
Course staff will use the following guidelines when grading your final project write-ups. Keep in mind however, that if there is a good reason why your project doesn't match the rubric below, we will take that into consideration when grading your report. For example, we recognize that purely theoretical or pure data analysis projects may not fit the rubric below perfectly, and that depending on your project you may want swap the ordering of certain sections. But hopefully all projects can be roughly mapped to the criteria below:
Unlike the project proposal and milestone, we plan to assign individual scores to team members for the final project report. We observed that there is a skewed distribution of work in some of the teams and would like to take that into account when we are grading. Your score for the final report will now be a function of two aspects:
In order to do be able to assign such individual scores, we want you to write down a brief summary of the
individual contributions of each of the team members in the format outlined below at the end of each report:
Example:
----------------
Team Member 1: Plotting graphs during data analysis, crawling the data, preliminary data analysis
Team Member 2: Problem formulation, writing up the report, coming up with the algorithm
Team Member 3: Coding up the algorithm, running tests, tabulating final results
---------------
If you fail to outline individual contributions at the end of your report, we will assign equal score to all the
team members.
The goal of the project presentation session is to give you a chance to see what your classmates have been working on. Instead of an in-person poster session, we will be holding a virtual project presentation session on Zoom. We will divide the project groups into two parallel sessions, each of which will be moderated by different TAs. In each session, we will go through the individual presentations, which consist of a 7-minute prerecorded video, along with a 5-minute Q&A session.
Many of the students in this course are actively conducting research in data mining and machine learning, so we welcome and even encourage projects that align with research goals beyond this course. However, it is critical that projects define what they will specifically accomplish in the scope of the course. The course project must stand on its own, not merely be a snapshot of an outside research process.
Here are example projects from the CSE 547 version of the course in Spring 2019, 2020 and 2021. For this course, it can but does not have to be network-related, and we especially focus on projects with datasets of non-trivial size. You should not be able to be trivially solve your project problem quickly on your laptop. Your project should involve at least one highly non-trivial component related to this course.
Project Title | Student(s) | Resources |
---|---|---|
Transformer-based models for protein structure generation | Abhijit Bhatnagar, David Juergens, Prashant Rangarajan | Report |
Fake News Detection on Social Media | Anton Lykov, Sandeep Tiwari, Venkata Sai Muktevi | Report |
Clustering Covid-19 Viral Genomes and Spike Protein Sequences | Aishwarya Mandyam, Cheng Ni, Jack Khuu | Report |
Collaborative Filtering using a Trust Social Network with Deep Learning for Initialization | Weijia Zhang, Reed Zhang, Zhitao Yu | Report |
Evaluating Identity and Arguments Online | Amanda Baughan, Elena Khasanova, Erik Tomasic | Report |
Gene Expression Data Mining of ARCHS4 Project Proposal | Aleksander Braksator, Qinai Xu, Haobo Zhang | Report |
Champion Recommender System For League of Legends | Young Seok Kim, Pradeep Prabhakar, Aniruddha Dutta | Report |
Modeling Wellness using Smartphone and Activity Data | Galen Weld, Orson Xu, Ather Sharif | Report |
Overlapping Community Detection via Edge-space Representation | Ruojin he, Fengjie Chen, Yikun Zhang | Report |
Air Pollution Mapping and Prediction | Mingyu Wang, Su Ye, Gaurav Mahamun | Report |
BITSCOPE: Scaling Bitcoin Address De-anonymization using Multi-resolution Chaining | Tianyi Zhou, Zhitong (Mia) Xie, Zhen Zhang | Report |
Transformer Based Video Matting | Jackson Stokes | Report |