lec 10: MapReduce

administrivia

hw2 out (step 1 readdir)

midterm: next Friday; review next Thursday

next reading: The Night Watch

today’s plan

inheritance & virtual functions

MapReduce

inheritance

review CSE 143

“is-a” relationship between classes

code reuse & extensibility

example

class A {
public:
    virtual void foo() {
        std::cout << "A" << std::endl;
    }
};

class B : public A {
public:
    virtual void foo() override {
        std::cout << "B" << std::endl;
    }
};

virtual

override controls: override & final

dynamic & static dispatch

...
int main() {
    A a;
    B b;
    A *p = &b;
    A &r = b;
    a.foo();
    b.foo();
    p->foo();
    r.foo();
}

q: what’s the output?

q: what if we remove all the virtuals (and override)?

MapReduce

large-scale distributed data processing at Google

inspired by functional programming

OSDI 2004, by Jeff Dean and Sanjay Ghemawat

other Google systems: GFS, BigTable, Spanner, …

influence: Hadoop, Dryad, Spark, …

(Feb 26, CSE distinguished lecture series, Jeff Dean)

programming model

split input -> (k1, v1)’s

map(k1, v1) → list (k2, v2)

shuffle map output to reduce

reduce (k2, list (v2)) → list (v3)

collect reduce output

word count

map(name, doc):
    for word in doc:
        emitIntermediate(word, "1")

reduce(word, counts):
    result = 0
    for c in counts:
        result += int(c)
    emit(word, str(result))

programmers extend Mapper and Reducer classes

implement the two virtual functions

the MapReduce library invoke the two functions

word count example

input document

You wake up, segfaults

Post up, segfaults

Ride round in it, segfaults

Flossin on that, segfaults

M = 4 mappers, R = 2 reducers

partitioning function: strlen(w) mod 2

workflow

user program splits input into M pieces

the master picks M mappers and R reducers

each mapper partitions output to R pieces & notifies the master

the master notifies reducers to read data from mappers

each reducer notifies the master when done

the master notifies user program

performance

number of jobs: M + R

input/output to GFS

disk I/O

network: M mappers to R reduces

stragglers: replicate - run backup tasks (Figure 3)

see also: electricity & cooling; Facebook data center in Sweden

video1, video2, Bloomberg

failure

commodity machines

assumption: workers are fail-stop (may reboot)

no malicious machines

what happens if mapper/reducer/master/network fails?

earthquake (all machines in one datacenter down)?

deal with failures

rerun: TCP, map-reduce

replicate: GFS

why okay to rerun: functional - the same output!

improvements

language (other than C++/Java): Sawzall, Pig, SCOPE, …

model: what computations cannot be expressed?

further readings