Today, RPC semantics

Wednesday - how do we reason about the order of events in a distributed system?

Friday - how do we coordinate two servers so that they have better availability than a single server; also needed for lab 2.
---

1.  Remote Procedure Call (RPC)

a. To recap: a key piece of distrib sys machinery; we use it throughout the labs

  goal: easy-to-program network communication
    hides most details of client/server communication
    client call is much like ordinary procedure call
    server handlers are much like ordinary procedures
  RPC is widely used!

b. RPC ideally makes net communication look just like fn call:
  Client:
    z = fn(x, y)
  Server:
    fn(x, y) {
      compute
      return z
    }
  RPC aims for this level of transparency

c. RPC message diagram:
  Client             Server
    request--->
       <---response

d. Software structure
  client app         handlers
    stubs           dispatcher
   RPC lib           RPC lib
     net  ------------ net

e. Examples from lab 1:
  DoJob
  ShutDown
  Register
 
f. A few details:
  Marshalling: format data into packets
    Tricky for arrays, pointers, objects, &c
    some things you cannot pass: e.g., channels
    Go's RPC library is pretty powerful!
  Binding: how does client know who to talk to?
    Typically, a registry -- e.g. DNS
  Threads:
    Client often has many threads, so > 1 call outstanding, match up replies
    Handlers may be slow, so server often runs each call in a thread

g. RPC vs. procedure call: What's equivalent?

Parameters -- request message
Result -- reply message
Name of procedure -- request message, usually a code
Return address -- request/reply have message ID so that reply bound to waiting caller thread 

2. RPC problem: what to do about failures?
  e.g. lost packets, broken network, crashed servers

What does a failure look like to the RPC library?
  It never sees a response from the server
  It does *not* know if the server saw the request!
    Maybe server/net failed just before sending reply


Three options: rough order of complexity
  At least once: RPC library keeps retrying until it succeeds
  At most once: RPC library ensures that RPC is performed zero or once
  Exactly once: RPC library ensures that RPC is performed once

a. Simplest scheme: "at least once" behavior
  RPC library waits for response for a while
  If none arrives, re-send the request
  Do this a few times
  Still no response -- return an error to the application

Q: is "at least once" easy for applications to cope with?

Simple problem (non-replicated key/value server)
  client sends Put(a)
  server gets request, but network drops reply
  client sends Put(a) again
  should server respond "yes"?
  or "no"?
  what if it isn't a put but "deduct $10 from bank account"

Harder problem:
  Put(a)  -- but network delays the packet
  Put(a) -- response arrives
  now network delivers the delayed Put(a) !!!

Is at-least-once ever OK?
  yes: if no side effects -- read-only operations (or idempotent ops)
  yes: if application has its own plan for detecting duplicates

b. "at most once"
  client includes unique ID (UID) with each request
    uses same UID for re-send
  server RPC code detects duplicate requests
    returns previous reply instead of re-running handler
  server:
    if seen[uid]:
      r = old[uid]
    else
      r = handler()
      old[uid] = r
      seen[uid] = true

some at-most-once complexities
  how to ensure UID is unique?
    big random number?
    combine unique client ID (ip address?) with sequence #?
  server must eventually discard info about old RPCs
    when is discard safe?
    idea:
      unique client IDs
      per-client RPC sequence numbers
      client includes "seen all replies <= X" with every RPC
      much like TCP sequence #s and acks
    or only allow client one outstanding RPC at a time
      arrival of seq+1 allows server to discard all <= seq
    or client agrees to keep retrying for < 5 minutes
      server discards after 5+ minutes
  how to handle dup req while original is still executing?
    server doesn't know reply yet; don't want to run twice
    idea: "pending" flag per executing RPC; wait or ignore

What if an at-most-once server crashes?
  if at-most-once duplicate info in memory, server will forget
    and accept duplicate requests
  maybe it should write the duplicate info to disk?
  maybe replica server should also replicate duplicate info?

What about "exactly once"?
  at-most-once plus unbounded retries
  if exactly once despite client/server failures, equivalent to two phase commit

Go RPC is "at-most-once" (and "usually once")
  open TCP connection
  write request to TCP connection
  TCP may retransmit, but server's TCP will filter out duplicates
  no retry in Go code (i.e. will NOT create 2nd TCP connection)
  Go RPC code returns an error if it doesn't get a reply
    perhaps after a timeout (from TCP)
    perhaps server didn't see request
    perhaps server processed request but server/net failed before reply came back


Go RPC's at-most-once isn't enough 
  What if RPC sent over TCP, but then reply never arrives and socket fails?  Do we know if RPC was done, not done?  If server crashed before, after, during the RPC? etc. 

  if worker doesn't respond, the master re-send to it to another worker
    but original worker may have not failed, and is working on it too
  Go RPC can't detect this kind of duplicate
    In lab 2 you will have to protect against these kinds of duplicates

RPC Concurrency 
  threads are a fundamental server structuring tool
  you'll use them a lot in the labs
  go well together with RPC 

Go RPC allows multiple clients to call RPC concurrently
Need to add synchronization (locks, channels) to prevent concurrent access to shared data.

  lock granularity
     goarse-grained -> little concurrency/parallelism
     fine-grained -> lots of concurrency, but race and deadlocks


2.  

Three RPC calls in the mapreduce go code.

Register: from worker -> master, to say worker is ready for work

DoJob: from master -> worker, to say here is work (either Map or Reduce)

Shutdown: from master -> worker, to say we're done

By convention in all of the labs, RPC's have two arguments, arg *FuncArgs, reply *FuncReply and return error.

When start up, master initializes itself then creates the workers; but how does it know when the workers are ready to receive work?  Within a go program, it could just create a channel and put work into it.  But since we're a distributed system, need to create a socket to the worker -- only works once the worker is listening for the socket. 

So: The slaves intialize, then register with the master; saying its ok to send work.  

// called from worker.go - ready to accept work
// defined in mapreduce.go -- tell master I'm ready to accept work
// args are defined in common.go
func (mr *MapReduce) Register(args *RegisterArgs, res *RegisterReply) error

// also need an RPC to do work
// called from master
// implemented in worker
func (wk *Worker) DoJob(arg *DoJobArgs, res *DoJobReply) error 

RPC's can fail -- e.g., communication failure, worker failure, etc.
If so, the caller will get back an error

// What are the semantics of the return values from RPC?

ok := call(worker, "Worker.DoJob", args, &reply)

// if ok, call performed
// if !ok, was call performed? 
// For Lab1, failures are fail-stop: if ok = false, worker did not and will
// never touch the intermediate files that it was asked to create
// We'll remove this assumption in Lab 2.

// There's also an RPC to tell worker to shut down
func (wk *Worker) Shutdown(args *ShutdownArgs, res *ShutdownReply) error

Finally, RPC's are blocking, so in order for the master to do
more than one MapR task at the same time, it needs multiple RPC's
to be outstanding.  Hence you'll need to create a thread per worker
or a thread per RPC, and you'll need to keep track of which 
tasks are still to be done, and which are complete.

Only start Reduce tasks once all Map tasks are done

c. part iii is to handle worker failures

I ended up implementing part ii and part iii at the same time.  If you hand a Map task to a worker, and the RPC fails, then you need to hand it to a different worker -- e.g., put it back on the task list

d. RPC Concurrency

This will be more relevant in Lab 2, but worth mentioning here.

At the server, RPCs from different clients will run concurrently. 
RPC's from the same client run sequentially.

Here, we have one master, so its not an issue. But an implication is that
if the implementation of an RPC accesses shared state, you'll need locks 
or channels to provide mutual exclusion.

2. MapReduce: we had gotten most of the way through the discussion, but I did want to say a few more words. 

Example: grep (look for which lines contain a pattern of text)
   master splits file into M almost equal chunks at line boundaries
   calls Map on each partition
   map phase
         for each partition, call map on each line of text
	    search line for word
	    output line number, line of text if word shows up, nil if not
   partition results among R reducers
	map writes each output record into a file, partitioned on line #
   reduce phase
         Reduce job collects 1/R output from each Map job
	    all map jobs have completed!
	 reduce(k1, v1) -> v2 
	    identity function: v1 in, v1 out
   merges phase:
         master merges R outputs


Questions:
0. Number of RPCs:  M+R jobs

1. how many intermediate files?  Example: 200K M; 5K R; 2K N

2. How do "stragglers" impact performance?

Assume: M, R > N servers

Even so, reduce phase can't start until last map task finishes;
mapR as a whole can't finish until last reduce job finishes.

Solution in paper?  Retry any task that doesn't complete quickly.

3. Performance:
   how much speed up do we get on N machines?
       ideal: N
   why less than ideal? 
      stragglers
      network to collect a Reduce partition 
      network to interact with FS
      disk I/O

4. Fault tolerance model
  master is not fault tolerant
    assumption: this single machine won't fail during running a mapreduce app
    but many workers, so have to handle their failures
  assumption: workers are fail stop
    they fail and stop (e.g., don't send garbled weird packets after a failure)
    they may reboot (in lab 1 -- they fail and don't rejoin)
  
What kinds of faults might we want to tolerate?
  network:
    lost packets
    duplicated packets
    temporary network failure
      server disconnected
      network partitioned
  server:
    server crash+restart   (master versus worker?)
    server fails permanently  (master versus worker?)
    all servers fail simultaneously -- power/earthquake
    bad case: crash mid-way through complex operation
       what happens fail in the middle of map or reduce?
    bugs -- (mostly) not in this course
       what happens when bug in map or reduce? 
       same bug in Map over and over?
       management software kills app
   malice -- (mostly) not in this course

Tools for dealing with faults?
  retry -- e.g. if pkt is lost, or server crash+restart 
     packets (TCP) and MR jobs
     may execute MR job twice!
  replicate -- e.g. if  one server or part of net has failed
     Lab 2
  replace -- for long-term health  
    e.g., worker

Retry jobs
  network failure: oops execute job twice
       ok for map reduce, because map/reduce produces same output
          map/reduce are "functional" or "deterministic"
       how about intermediate files?
          atomic rename
  worker failure: may have executed job or not
      so, we may execute job more than once!
      but ok for MapReduce
         as long Map and Reduce function are deterministic
	 what would make Map or Reduce not deterministic?
    is executing a request twice in general ok? 
      no. in fact, often not.
      unhappy customer if you execute one credit card transaction several times
  adding servers
      easy in MapReduce--just tell master
      hard in general
         server may have lost state (need to get new state)
	 server may have rebooted quickly
	    may need to recognize that to bring server up to date
	    server may have a new role after reboot (e.g., not the primary)
     these harder issues you would have to deal with to make master fault tolerant
        topic of later labs