================================================================
CSE 344 -- Spring 2011
Lecture 23:   Map-Reduce

READING ASSIGNMENT: 
   Chapter 2 of Mining of Massive Datasets, by Rajaraman and Ullman
   (go to the course website --> Calendar/LectureNotes --> Readings)

================================================================

Google: paper published 2004
Free variant: Hadoop

Map-reduce = high-level programming model and implementation for
   large-scale parallel data processing

================================================================

Data model:

Files !

A file = a bag of (key, value) pairs
 

A map-reduce program:

    Input: a bag of (inputkey, value)pairs
        [actually: the input is a collection of "elements" (records)
          and the application may split it into key+value,
          but it is often convenient to think of it as (key,value)
          pairs]

    Output: a bag of (outputkey, value)pairs


    Step 1: the Map Phase
        User provides the MAP-function:
            Input: one (input key, value)
            Ouput: bag of (intermediate key, value)pairs

        System applies the map function in parallel to all (input key, value) pairs in the input file

    Step 2: the REDUCE Phase
        User provides the REDUCE function:
            Input: (intermediate key, bag of values)
            Output: bag of output values

        System groups all pairs with the same intermediate key, and passes the bag of values to the REDUCE function

Example:

Counting the number of occurrences of each word in a large collection of documents


map(String key, String value):
    // key: document name
    // value: document contents
    for each word w in value:
        EmitIntermediate(w, "1"):


reduce(String key, Iterator values):
    // key: a word
    // values: a list of counts
    int result = 0;
    for each v in values:
         result += ParseInt(v);
         Emit(AsString(result));

---------------

Map:
     input: element (k,v)
     intermediate; bag of (i,w)

Reduce:
     input:  (i, set of w's)
     output: (i, aggregate(w))

                MAP                                 REDUCE
   ----------               ----------     ----------        ---------- 
   | (k1,v1) | \--------->  | (i1,w1) |    | (i1,sw1)| -->   | (i1,r1) |
   ----------|  \           ----------|    ----------|       ----------|
   | (k2,v2) | ---------->  | (i2,w2) |    | (i2,sw2)| -->   | (i2,r2) |
   ----------|    \         ----------|    ----------|       ----------|
   | (k3,v3) |     \        | (i3,w3) |    | (i3,sw3)|       | (i3,r3) |
   ----------|      \       ----------|    ----------|       ----------|
   |         |       \      |         |    |         |       |         |
   | ....    |         ->   | ....    |    | ....    |       | ....    |
   ---------                ---------      ---------         ---------  


-------------


Document(did, word)
Output(word, cnt)

*** In class: how do we express Output = f(Document) in SQL ?

================================================================

*** Example in class: 
           given Document(did, word), 
           compute an "inverted file" Output(word, did*)

"Inverted file" = a file that associates to word the list of documents
   that contain it; used as an index


================================================================

The Combine function: 
    -- same type as Reduce
    -- but applied right after the Map phase
    -- goal: do a local reduction of the size of the data, before re-partition


================================================================
DFS = Distributed File System

   GFS = google file system (proprietary)
   HDFS = Hadoop Distributed File System (free)

very large files (terabytes)
file divided into "chunks" (typically 64 MB)
chunks are replicated at three different compute nodes (typically 3)

================================================================

Map/Reduce Implementations

    Google = original one
    Hadoop = freely available from apache

Implementation of Map/Reduce

There is one master node

 -- Master creates workers (=servers)
          Map workers
          Reduce workers

 -- Create "map tasks" and "reduce tasks" Typically:
          one or more chunks per map task
          fewer reduce tasks R (may be user-specified)

 -- Key-value pairs generated by each Map task are collected by a
    "master controller" and sorted by key

 -- The set of key-value pairs are written to local disk, partitioned
    into R regions

 -- Master assigns regions the R reduce tasks

 -- Reduce workers read regions from the map workers' local disks 

Details

 -- Worker failure:
         Master pings workers periodically,
         If down then reassigns its work 
             to all other workers -->  good load balance

  -- Choice of M and R:
         Larger is better for load balancing
         Limitation: master needs O(M x R) memory

Backup tasks:

 "Straggler" = a machine that takes unusually long time to complete one of the last tasks. Eg:

       -- Bad disk forces frequent correctable errors (30MB/s --> 1MB/s)

       -- The cluster scheduler has scheduled other tasks on that machine

  Stragglers are a main reason for slowdown

  Solution: pre-emptive backup execution of the last few remaining in-progress tasks

================================================================

Summary

Hides scheduling and parallelization details
However, very limited queries
Difficult to write more complex tasks
Need multiple map-reduce operations
Solution:  Pig Latin