================================================================ CSE 344 -- Spring 2011 Lecture 23: Map-Reduce READING ASSIGNMENT: Chapter 2 of Mining of Massive Datasets, by Rajaraman and Ullman (go to the course website --> Calendar/LectureNotes --> Readings) ================================================================ Google: paper published 2004 Free variant: Hadoop Map-reduce = high-level programming model and implementation for large-scale parallel data processing ================================================================ Data model: Files ! A file = a bag of (key, value) pairs A map-reduce program: Input: a bag of (inputkey, value)pairs [actually: the input is a collection of "elements" (records) and the application may split it into key+value, but it is often convenient to think of it as (key,value) pairs] Output: a bag of (outputkey, value)pairs Step 1: the Map Phase User provides the MAP-function: Input: one (input key, value) Ouput: bag of (intermediate key, value)pairs System applies the map function in parallel to all (input key, value) pairs in the input file Step 2: the REDUCE Phase User provides the REDUCE function: Input: (intermediate key, bag of values) Output: bag of output values System groups all pairs with the same intermediate key, and passes the bag of values to the REDUCE function Example: Counting the number of occurrences of each word in a large collection of documents map(String key, String value): // key: document name // value: document contents for each word w in value: EmitIntermediate(w, "1"): reduce(String key, Iterator values): // key: a word // values: a list of counts int result = 0; for each v in values: result += ParseInt(v); Emit(AsString(result)); --------------- Map: input: element (k,v) intermediate; bag of (i,w) Reduce: input: (i, set of w's) output: (i, aggregate(w)) MAP REDUCE ---------- ---------- ---------- ---------- | (k1,v1) | \---------> | (i1,w1) | | (i1,sw1)| --> | (i1,r1) | ----------| \ ----------| ----------| ----------| | (k2,v2) | ----------> | (i2,w2) | | (i2,sw2)| --> | (i2,r2) | ----------| \ ----------| ----------| ----------| | (k3,v3) | \ | (i3,w3) | | (i3,sw3)| | (i3,r3) | ----------| \ ----------| ----------| ----------| | | \ | | | | | | | .... | -> | .... | | .... | | .... | --------- --------- --------- --------- ------------- Document(did, word) Output(word, cnt) *** In class: how do we express Output = f(Document) in SQL ? ================================================================ *** Example in class: given Document(did, word), compute an "inverted file" Output(word, did*) "Inverted file" = a file that associates to word the list of documents that contain it; used as an index ================================================================ The Combine function: -- same type as Reduce -- but applied right after the Map phase -- goal: do a local reduction of the size of the data, before re-partition ================================================================ DFS = Distributed File System GFS = google file system (proprietary) HDFS = Hadoop Distributed File System (free) very large files (terabytes) file divided into "chunks" (typically 64 MB) chunks are replicated at three different compute nodes (typically 3) ================================================================ Map/Reduce Implementations Google = original one Hadoop = freely available from apache Implementation of Map/Reduce There is one master node -- Master creates workers (=servers) Map workers Reduce workers -- Create "map tasks" and "reduce tasks" Typically: one or more chunks per map task fewer reduce tasks R (may be user-specified) -- Key-value pairs generated by each Map task are collected by a "master controller" and sorted by key -- The set of key-value pairs are written to local disk, partitioned into R regions -- Master assigns regions the R reduce tasks -- Reduce workers read regions from the map workers' local disks Details -- Worker failure: Master pings workers periodically, If down then reassigns its work to all other workers --> good load balance -- Choice of M and R: Larger is better for load balancing Limitation: master needs O(M x R) memory Backup tasks: "Straggler" = a machine that takes unusually long time to complete one of the last tasks. Eg: -- Bad disk forces frequent correctable errors (30MB/s --> 1MB/s) -- The cluster scheduler has scheduled other tasks on that machine Stragglers are a main reason for slowdown Solution: pre-emptive backup execution of the last few remaining in-progress tasks ================================================================ Summary Hides scheduling and parallelization details However, very limited queries Difficult to write more complex tasks Need multiple map-reduce operations Solution: Pig Latin