hw2 out (step 1 readdir)
midterm: next Friday; review next Thursday
next reading: The Night Watch
inheritance & virtual functions
MapReduce
review CSE 143
“is-a” relationship between classes
code reuse & extensibility
class A {
public:
virtual void foo() {
std::cout << "A" << std::endl;
}
};
class B : public A {
public:
virtual void foo() override {
std::cout << "B" << std::endl;
}
};
virtual
override controls: override
& final
...
int main() {
A a;
B b;
A *p = &b;
A &r = b;
a.foo();
b.foo();
p->foo();
r.foo();
}
q: what’s the output?
q: what if we remove all the virtual
s (and override
)?
large-scale distributed data processing at Google
inspired by functional programming
OSDI 2004, by Jeff Dean and Sanjay Ghemawat
other Google systems: GFS, BigTable, Spanner, …
influence: Hadoop, Dryad, Spark, …
(Feb 26, CSE distinguished lecture series, Jeff Dean)
split input -> (k1, v1)’s
map(k1, v1) → list (k2, v2)
shuffle map output to reduce
reduce (k2, list (v2)) → list (v3)
collect reduce output
map(name, doc):
for word in doc:
emitIntermediate(word, "1")
reduce(word, counts):
result = 0
for c in counts:
result += int(c)
emit(word, str(result))
programmers extend Mapper and Reducer classes
implement the two virtual functions
the MapReduce library invoke the two functions
You wake up, segfaults
Post up, segfaults
Ride round in it, segfaults
Flossin on that, segfaults
M = 4 mappers, R = 2 reducers
partitioning function: strlen(w) mod 2
user program splits input into M pieces
the master picks M mappers and R reducers
each mapper partitions output to R pieces & notifies the master
the master notifies reducers to read data from mappers
each reducer notifies the master when done
the master notifies user program
number of jobs: M + R
input/output to GFS
disk I/O
network: M mappers to R reduces
stragglers: replicate - run backup tasks (Figure 3)
see also: electricity & cooling; Facebook data center in Sweden
commodity machines
assumption: workers are fail-stop (may reboot)
no malicious machines
what happens if mapper/reducer/master/network fails?
earthquake (all machines in one datacenter down)?
rerun: TCP, map-reduce
replicate: GFS
why okay to rerun: functional - the same output!
language (other than C++/Java): Sawzall, Pig, SCOPE, …
model: what computations cannot be expressed?