Lecture 2: MapReduce

1. Workload question came up. 

Important for you to know this is among the most difficult
material in CS -- the core algorithm we're going to have you
implement is probably the most difficult to understand 
algorithm in widespread use.  

As with other courses, difference between users of distributed
systems (easy) and how to build distributed systems (hard).

I hope it will be 10-15% less than fall quarter 451, which was 10-15% less
than last spring's 451.  For the masters students, hard to tell -- 
easier since less coordination, harder because less of a safety net.

Key to surviving: we're going to provide a lot of help.
But you need to think and plan very carefully before writing code.
And there are outs: skip an assignment if you have to.
Even skip two -- it will have only a minimal effect on your grade.

No exams; there are a few problem sets, but they will be short.

1. Common knowledge vs. knowledge 

OK, distributed systems is conceptually hard.  Let me give you an example:
group of people who can see each other.  Two have green dots.

If you are one, can you tell if you have a green dot?  Without a mirror. 
Obviously not!

Add knowledge everyone already has: someone has a green dot.

Yet with that, one of the people with the green dot can now declare, yes.

How?

2. Failures are endemic

Initially, planes designed with one engine -- kind of made the whole endeavor 
risky! So: two engines.  But ontime arrivals got much worse with two
engines than with one.  Why?  

Typical data center has several hundred thousand servers; millions of disks. 
tens of thousands of network switches.  Many will be broken at any given
point in time, or working badly.  Need to build systems that don't just avoid
doing the wrong thing under a failure, but continue to operate even while 
partially broken.

FB DC in Oregon: $1B or so, size of football stadium.  No AC, just evaporative eater cooling (bad because they are in a desert!  but good because they are in a desert!)  Inside, kind of like being in a wind tunnel.

In winter, air circulates, no heating needed.  In summer, exits.

Draw a picture of the airflow; hot and cold aisles.

3. Labs

focus: fault tolerance and consistency -- central to distrib sys
  lab 1: MapReduce
  labs 2/3/4: storage servers
    progressively more sophisticated (tolerate more kinds of faults)
       progressively harder too!
    patterened after real systems, e.g. NoSQL storage systems
    Lab 4 has core of a real-world design for 1000s of servers

what you'll learn from the labs
  easy to listen to lecture / read paper and think you understand
  building forces you to really understand
    I hear and I forget, I see and I remember, I do and I understand  (Confucius?)
  you'll have to do some design yourself
    we supply skeleton, requirements, and tests
    but we leave you substantial scope to solve problems your own way
  you'll get experience debugging distributed systems

test cases simulate failure scenarios
  tricky to debug: concurrency and failures
      many client and servers operating in parallel
      test cases make servers fail at the "most" inopportune time

[NOTE: not the same thing as an exhaustive test suite!  What would that be?  Mention Verdi -- ongoing project to prove correctness of core dist system software.]

  think first before starting to code!
     otherwise your solution will be a mess
     and/or, it will take you a lot of time
  code review
     learn from others
     judge other solutions   

we've tried to ensure that the hard problems have to do w/ distrib sys
  not e.g. fighting against language, libraries, &c
  thus Go (type-safe, garbage collected, slick RPC library)
  thus fairly simple services (mapreduce, key/value store)

3. Lab 1: MapReduce 

A programming model to allow most people using the DC
to avoid thinking about failures and distribution.

  help you get up to speed on Go and distributed programming
  first exposure to some fault tolerance 
    motivation for better fault tolerance in later labs
  motivating app for many papers
  popular distributed programming framework 
   many descendants frameworks 

Computational model
  aimed at document processing

map (k1,v1) → list(k2,v2) 
reduce (k2,list(v2)) → list(v2)

    split doc -> set of <k1, v1> pairs
    run Map(k1, v1) on each element of each split -> set of <k2, v2> pairs
    coalesce results from each split into a list for each key
    run Reduce(k2, list(v2)) -> list(v2)
    merge result

  user writes a map function and reduce function
    framework takes care of parallelism, distribution, and fault tolerance
  what computations are not targeted
    anything that updates a document

Draw picture:

input: partitioned into M (nearly equal, but not necessarily equal) chunks
1. Run mapper on each record in each chunk
2. Produces intermediate list of key/value pairs
3. Group by hash(key) (on each mapper) -> R chunks on each mapper
4. R reducers each read the files intended for it
5. output list of values

Example: grep (look for which lines contain a pattern of text)
   master splits file into M almost equal chunks at line boundaries
   calls Map on each partition
   map phase
         for each partition, call map on each line of text
	    search line for word
	    output line number, line of text if word shows up, nil if not
   partition results among R reducers
	map writes each output record into a file, hashed on key
   reduce phase
         Reduce job collects 1/R output from each Map job
	    all map jobs have completed!
	 reduce(k1, v1) -> v2 
	    identity function: v1 in, v1 out
   merges phase:
         master merges R outputs

Performance
   number of jobs:  M+R map jobs
   M, R > N servers
   how many intermediate files?  Example: 200K M; 5K R; 2K N

Questions:
What exactly would happen if one block of one hard drive got erased during a map/reduce computation? What parts of the system would fix the error (if any), and what parts of the system would be oblivious (if any)?

How do "stragglers" impact performance?

   how much speed up do we get on N machines?
       ideal: N
   bottlenecks
      stragglers
      network to collect a Reduce partition 
      network to interact with FS
      disk I/O

Why roughly fixed size for the initial size of split files? 64MB


Fault tolerance model
  master is not fault tolerant
    assumption: this single machine won't fail during running a mapreduce app
    but many workers, so have to handle their failures
  assumption: workers are fail stop
    they fail and stop (e.g., don't send garbled weird packets after a failure)
    they may reboot       
  
What kinds of faults might we want to tolerate?
  network:
    lost packets
    duplicated packets
    temporary network failure
      server disconnected
      network partitioned
  server:
    server crash+restart   (master versus worker?)
    server fails permanently  (master versus worker?)
    all servers fail simultaneously -- power/earthquake
    bad case: crash mid-way through complex operation
       what happens fail in the middle of map or reduce?
    bugs -- but not in this course
       what happens when bug in map or reduce? 
       same bug in Map over and over?
       management software kills app
   malice -- but not in this course

Tools for dealing with faults?
  retry -- e.g. if pkt is lost, or server crash+restart 
     packets (TCP) and MR jobs
     may execute MR job twice!
  replicate -- e.g. if  one server or part of net has failed
     next labs
  replace -- for long-term health  
    e.g., worker

Retry jobs
  network falure: oops execute job twice
       ok for map reduce, because map/reduce produces same output
          map/reduce are "functional" or "deterministic"
       how about intermediate files?
          atomic rename
  worker failure: may have executed job or not
      so, we may execute job more than once!
      but ok for MapReduce
         as long Map and Reduce function are deterministic
	 what would make Map or Reduce not deterministic?
    is executing a request twice in general ok? 
      no. in fact, often not.
      unhappy customer if you execute one credit card transaction several times
  adding servers
      easy in MapReduce--just tell master
      hard in general
         server may have lost state (need to get new state)
	 server may have rebooted quickly
	    may need to recognize that to bring server up to date
	    server may have a new role after reboot (e.g., not the primary)
     these harder issues you would have to deal with to make master fault tolerant
        topic of later labs

lab 1 simplifications:
  no key in map
  assume global file system
  no partial failures (files either completely written or not created; if restart some failed operation, ok to write to the same filename)
      
the lab 1 app (see main/wc.go)
    stubs for Map and Reduce
    you fill them out to implement word count (wc)
    how would you write grep?
  
the lab 1 sequential implementation (see mapreduce/mapreduce.go)
    demo: run wc.go
    code walk through start with RunSingle()

the lab 1 worker (see mapreduce/worker.go)
   the remote procedure calls (RPCs) arguments and replies (see mapreduce/common.go)
   Server side of RPC
     RPC handlers have a particular signature
        DoJob
        Shutdown
   RunWorker
     rpcs.Register: register named handlers -- so Call() can find them
     Listen: create socket on which to listen for RPC requests
        for distributed implementation, replace "unix" w. "tcp"
        replace "me" with a <dns,port> tuple name
     ServeConn: runs in a separate thread (why?)
        serve RPC concurrently
	a RPC may block
   Client side of RPC
     Register()
   call()  (see common.go)
     make an RPC
     lab code dials for each request
        typical code uses a network connection for several requests
     	  but, real must be prepared to redial anyway
	  a network connection failure, doesn't imply a server failure!
	we also do this to introduce failure scenarios easily
	  intermittent network failures
	  just loosing the reply, but not the request

the lab 1 master (see mapreduce/master.go)
    You write it
    You will have to deal with distributing jobs
    You will have to deal with worker failures

---
DeWitt

A giant step backward in the programming paradigm for large-scale data intensive applications

A sub-optimal implementation, in that it uses brute force instead of indexing

Not novel at all -- it represents a specific implementation of well known techniques developed nearly 25 years ago

Missing most of the features that are routinely included in current DBMS

Incompatible with all of the tools DBMS users have come to depend on"