For this project, you will work in groups of 2-3 to define and implement a MapReduce-style algorithm in HaDoop. At a high level, there are three major parts to this project:
The proposal should consist of only a few paragraphs, and contain the following information:
Make sure the names of all your group members are somewhere prominent at the beginning of the proposal.
This should be a few paragraphs of text describing background and goals of the project. The background should be detailed enough for the course staff to understand your problem space, specifically describe the algorithm/problem you are trying to implement/solve, and discuss any benefit you expect to gain from using MapReduce.
Try to be as specific as you can. Even at this stage of the project, the more specific you are, the less likely it will be for you to be blindsided by an unexpected issue later.
This should be a bulleted list of concrete and measurable deliverables that will define a successfully completed project. Keep in mind that you must complete these items within 2-3 weeks and that this section will form the basis for how the course staff evaluates your project. So be sure to set your criteria appropiately!
We recommend defining a set of incremental deliverables instead of one big deliverable. For example, the completion criteria for a system to build an inverted index might be:
When the project is finished, the following will be true:
- Map nodes can read and process input data
- Reduce nodes will emit a
word->(DocID, FilePosition)
dictionary- Stop words are identified and removed from the remainder of the indexing pipeline, but not removed from the
word->(DocID, FilePosition)
dictionary- The entire system will be able to process up to M docs within N minutes
- ...
Don't state something up like "when we are done, the project should be able to take a bunch of docs and output a good index." That's not specific enough.
The previous completion criteria section defined minimal functionality for your project; in this section, let your imagination run wild. List at least 3 extensions to your project which would be neat to implement, and give a several-sentence-long description of how you might implement that extension.
Discuss your required input data and where you expect to get the data from (eg. "I need a webcrawl, and Alden's already got it on the cluster"). If you have any additional required resources -- especially resources which are not being provided by the class! -- please also describe it in this section (eg, "I need a second cluster for XYZ processing").
This section can be blank. If you have any concerns or thoughts about the project that don't fit into a previous section, write it here.