|
|
|
|
In any reasonably complex system, it's rare that there is a single so-called
"correct" answer; more often, the final solution is a balance of various
tradeoffs. Oftentimes, a choice must be made between system complexity,
reliability guaranteees (where "reliability" may mean several things,
including as data consistancy or system uptime), or runtime (where "runtime"
may mean several things, such as latency or bandwidth).
With that in mind, please answer the following questions when reading the
MapReduce paper. If you're not
used to diving directly into academic papers,
this
reader might provide a gentler introduction to MapReduce's underlying
concepts.
- Name two possible uses for MapReduce besides WordCount. Do not list
problems that calculate a specific attribute of the words in a
document corpus, eg, DistributedGrep or WhitespaceCount.
- What kind of constraints does MapReduce place on its problem domain?
Said in another way, what applications would not work in
MapReduce? eg, could you use MapReduce's for Amazon's shopping cart?
How about SETI@Home? An
MMORPG?
- Besides MapReduce, what is another possible strategy for parallelizing
computation over a large document corpus? What are the strengths and
weaknesses of that system relative to MapReduce?
- What are the scarce resources in a MapReduce system (eg, disk space,
personnel, CPU)? What is the absolute limit for adding those resources
into a MapReduce system (eg, if skilled personnel were the limiting
resource, then the hard limit would be the number of skilled labourers)?
Assuming that limit were reached, how could MapReduce be tweaked to
reduce dependancy on that resource?
- * Bonus question: In an earlier lecture, we gave an example
of shuffling the sentances in a document set by assigning each sentance
a randomly-generated key. While this solution works in the theoretical
MapReduce model, it is not guaranteed to work in practice. Why?
|