--------- Section 10: Map-Reduce Algorithms -- From chapter http://i.stanford.edu/~ullman/mmds/ch2.pdf -- sections 2.3.3. -- 2.3. 9. We study basic algorithms for the Map-Reduce framework. The basic operations: Map : read the input, and produce key-value pairs: (key,value) Reduce : take a key-list pair: (key,[t1, ..., tk]) and produces another key-list pair: (key,[t1', ..., tl']) ------------------------------------------------- 1. Matrix-Vector Multiplication ------------------------------------------------- How do we multiply a (n x n) matrix M with a vector v ? Map : each map task takes entire vector v (why is this possible?) and a chunk of the matrix M. Then, it produces the key-value pair (i, m_ij x v_j ) for each entry m_ij Reduce : every reduce task gets an entire row (i-th row). It then sums all the elements to get the value of the result vector at coordinate i. Result (i, x_i) Q1: does it matter how the matrix is initially divided between the servers? Q2: what happens if the vector v is too big to fit into the memory? Q3: how is the computation possible if n = 10^10 (that is the size of the WWW) ------------------------------------------------- 2. Relational Algebra Operators ------------------------------------------------- --> Selections: (easy, try to think how it is done. Do we even need the reduce step?) --> Projections: Map : for each tuple t, project to get t' and output: (t',t') Reduce : keep just one from each list (t', [t',..., t']) Q4: what is a combiner? How can we use it in this case? --> Union / Intersection Map : for any tuple t, output the key-value pair: (t,t) Reduce : [UNION] how many values will each key have? What do we need to do? [INTERSECTION] again, how many values? --> Difference (R - S) A tuple is in (R - S) if it is in R and not in S. Hence, we must "label" eaach tuple in the mapper Map : If t belongs in R (t,R) else in S (t,S) Reduce : output a tuple t only if the list is [R]. What are the other possible outcomes for the list? --> Join: R(A,B) with S(B,C) Map : for tuple R(a,b) produce: (b,(R,a)) and for S(b,c) produce: (b,(S,c)) Reduce : the list will have R and S-tuples. Compute the two sublists of the initial list and then create all the possible pairs. Q5: what can we do if we want to save some space on the join result? Q6: what happens when there is only one joining value? Q7: how would we compute the join R(A,B) S(A,C) T(A,D) ? Q8: can we do something more intersting if one R >> S ? --> Grouping and Aggregation Map : use as key the attributes that are used for the group by Reduce : apply the aggregation operator to the list Q9: what happens when we need to count everything, e.g. the size of the relation or sum a column ?