Project 2: Final Project

Objectives

The objectives for this project, in decreasing order of importance are:

Gain experience designing, and implementing code for MapReduce
Play with a medium sized cluster
Do something that you find neat

Description

For this project, you will work in groups of 2-3 to define and implement a MapReduce-style algorithm in HaDoop. At a high level, there are three major parts to this project:

Project proposal. Together with feedback from the course staff, you will write a proposal detailing the project you want to implement. If you don't have an idea for your project, you might want to look at the sample offline query processing project.
Implementation. Your project should take you approximately 2-3 to complete. Extensions are purely optional.
Presentation and review. You will have an opportunity to present your work to Google engineers and other students, and to reflect on the project as a whole

Project Proposal

The proposal should consist of only a few paragraphs, and contain the following information:

Names of Group Members
Project Summary
Completion Criteria
Possible Extensions
Data Set and Resources
Concerns and Comments

Names of Group Members

Make sure the names of all your group members are somewhere prominent at the beginning of the proposal.

Project Summary

This should be a few paragraphs of text describing background and goals of the project. The background should be detailed enough for the course staff to understand your problem space, specifically describe the algorithm/problem you are trying to implement/solve, and discuss any benefit you expect to gain from using MapReduce.

Try to be as specific as you can. Even at this stage of the project, the more specific you are, the less likely it will be for you to be blindsided by an unexpected issue later.

Completion Criteria

This should be a bulleted list of concrete and measurable deliverables that will define a successfully completed project. Keep in mind that you must complete these items within 2-3 weeks and that this section will form the basis for how the course staff evaluates your project. So be sure to set your criteria appropiately!

We recommend defining a set of incremental deliverables instead of one big deliverable. For example, the completion criteria for a system to build an inverted index might be:

When the project is finished, the following will be true:

Map nodes can read and process input data
Reduce nodes will emit a word->(DocID, FilePosition) dictionary
Stop words are identified and removed from the remainder of the indexing pipeline, but not removed from the word->(DocID, FilePosition) dictionary
The entire system will be able to process up to M docs within N minutes
...

Don't state something up like "when we are done, the project should be able to take a bunch of docs and output a good index." That's not specific enough.

Possible Extensions

The previous completion criteria section defined minimal functionality for your project; in this section, let your imagination run wild. List at least 3 extensions to your project which would be neat to implement, and give a several-sentence-long description of how you might implement that extension.

Data Set and Resources

Discuss your required input data and where you expect to get the data from (eg. "I need a webcrawl, and Alden's already got it on the cluster"). If you have any additional required resources -- especially resources which are not being provided by the class! -- please also describe it in this section (eg, "I need a second cluster for XYZ processing").

Concerns and Comments

This section can be blank. If you have any concerns or thoughts about the project that don't fit into a previous section, write it here.

Dates

Draft Proposal - Due Fri Jan 19, 2007 @ 12 noon

By 6pm Thursday, e-mail the instructors (awong at cs and hannahtang at goog) the initial draft of your proposal. The initial draft only needs to be a few paragraphs long. Only one draft is required per group.

Final Proposal - Due Wed Jan 24, 2007 @ 12 noon 6pm

The final proposal, which may include tweaks recommended by the instructors, is due by e-mail (aldenk at cs). Only one draft is required per group. You do not need to wait until this proposal has been submitted to begin working on your project; feel free to start anytime after you have received feedback from the instructors regarding your initial proposal.

Project Completion - Mid-February

This date is still TBD.

Project Presentation - Mid-February

This date is still TBD.

Last update: