Project 2: Final Project

Objectives

The objectives for this project, in decreasing order of importance are:
  1. Gain experience designing, and implementing code for MapReduce
  2. Play with a medium sized cluster
  3. Do something that you find neat

Description

For this project, you will work in groups of 2-3 to define and implement a MapReduce-style algorithm in HaDoop. At a high level, there are three major parts to this project:

Project Proposal

The proposal should consist of only a few paragraphs, and contain the following information:

  1. Names of Group Members
  2. Project Summary
  3. Completion Criteria
  4. Possible Extensions
  5. Data Set and Resources
  6. Concerns and Comments

Names of Group Members

Make sure the names of all your group members are somewhere prominent at the beginning of the proposal.

Project Summary

This should be a few paragraphs of text describing background and goals of the project. The background should be detailed enough for the course staff to understand your problem space, specifically describe the algorithm/problem you are trying to implement/solve, and discuss any benefit you expect to gain from using MapReduce.

Try to be as specific as you can. Even at this stage of the project, the more specific you are, the less likely it will be for you to be blindsided by an unexpected issue later.

Completion Criteria

This should be a bulleted list of concrete and measurable deliverables that will define a successfully completed project. Keep in mind that you must complete these items within 2-3 weeks and that this section will form the basis for how the course staff evaluates your project. So be sure to set your criteria appropiately!

We recommend defining a set of incremental deliverables instead of one big deliverable. For example, the completion criteria for a system to build an inverted index might be:

When the project is finished, the following will be true:

Don't state something up like "when we are done, the project should be able to take a bunch of docs and output a good index." That's not specific enough.

Possible Extensions

The previous completion criteria section defined minimal functionality for your project; in this section, let your imagination run wild. List at least 3 extensions to your project which would be neat to implement, and give a several-sentence-long description of how you might implement that extension.

Data Set and Resources

Discuss your required input data and where you expect to get the data from (eg. "I need a webcrawl, and Alden's already got it on the cluster"). If you have any additional required resources -- especially resources which are not being provided by the class! -- please also describe it in this section (eg, "I need a second cluster for XYZ processing").

Concerns and Comments

This section can be blank. If you have any concerns or thoughts about the project that don't fit into a previous section, write it here.

Dates

Draft Proposal - Due Fri Jan 19, 2007 @ 12 noon

By 6pm Thursday, e-mail the instructors (awong at cs and hannahtang at goog) the initial draft of your proposal. The initial draft only needs to be a few paragraphs long. Only one draft is required per group.

Final Proposal - Due Wed Jan 24, 2007 @ 12 noon 6pm

The final proposal, which may include tweaks recommended by the instructors, is due by e-mail (aldenk at cs). Only one draft is required per group. You do not need to wait until this proposal has been submitted to begin working on your project; feel free to start anytime after you have received feedback from the instructors regarding your initial proposal.

Project Completion - Mid-February

This date is still TBD.

Project Presentation - Mid-February

This date is still TBD.

 

 


Last update: