Today, RPC semantics Wednesday - how do we reason about the order of events in a distributed system? Friday - how do we coordinate two servers so that they have better availability than a single server; also needed for lab 2. --- 1. Remote Procedure Call (RPC) a. To recap: a key piece of distrib sys machinery; we use it throughout the labs goal: easy-to-program network communication hides most details of client/server communication client call is much like ordinary procedure call server handlers are much like ordinary procedures RPC is widely used! b. RPC ideally makes net communication look just like fn call: Client: z = fn(x, y) Server: fn(x, y) { compute return z } RPC aims for this level of transparency c. RPC message diagram: Client Server request---> <---response d. Software structure client app handlers stubs dispatcher RPC lib RPC lib net ------------ net e. Examples from lab 1: DoJob ShutDown Register f. A few details: Marshalling: format data into packets Tricky for arrays, pointers, objects, &c some things you cannot pass: e.g., channels Go's RPC library is pretty powerful! Binding: how does client know who to talk to? Typically, a registry -- e.g. DNS Threads: Client often has many threads, so > 1 call outstanding, match up replies Handlers may be slow, so server often runs each call in a thread g. RPC vs. procedure call: What's equivalent? Parameters -- request message Result -- reply message Name of procedure -- request message, usually a code Return address -- request/reply have message ID so that reply bound to waiting caller thread 2. RPC problem: what to do about failures? e.g. lost packets, broken network, crashed servers What does a failure look like to the RPC library? It never sees a response from the server It does *not* know if the server saw the request! Maybe server/net failed just before sending reply Three options: rough order of complexity At least once: RPC library keeps retrying until it succeeds At most once: RPC library ensures that RPC is performed zero or once Exactly once: RPC library ensures that RPC is performed once a. Simplest scheme: "at least once" behavior RPC library waits for response for a while If none arrives, re-send the request Do this a few times Still no response -- return an error to the application Q: is "at least once" easy for applications to cope with? Simple problem (non-replicated key/value server) client sends Put(a) server gets request, but network drops reply client sends Put(a) again should server respond "yes"? or "no"? what if it isn't a put but "deduct $10 from bank account" Harder problem: Put(a) -- but network delays the packet Put(a) -- response arrives now network delivers the delayed Put(a) !!! Is at-least-once ever OK? yes: if no side effects -- read-only operations (or idempotent ops) yes: if application has its own plan for detecting duplicates b. "at most once" client includes unique ID (UID) with each request uses same UID for re-send server RPC code detects duplicate requests returns previous reply instead of re-running handler server: if seen[uid]: r = old[uid] else r = handler() old[uid] = r seen[uid] = true some at-most-once complexities how to ensure UID is unique? big random number? combine unique client ID (ip address?) with sequence #? server must eventually discard info about old RPCs when is discard safe? idea: unique client IDs per-client RPC sequence numbers client includes "seen all replies <= X" with every RPC much like TCP sequence #s and acks or only allow client one outstanding RPC at a time arrival of seq+1 allows server to discard all <= seq or client agrees to keep retrying for < 5 minutes server discards after 5+ minutes how to handle dup req while original is still executing? server doesn't know reply yet; don't want to run twice idea: "pending" flag per executing RPC; wait or ignore What if an at-most-once server crashes? if at-most-once duplicate info in memory, server will forget and accept duplicate requests maybe it should write the duplicate info to disk? maybe replica server should also replicate duplicate info? What about "exactly once"? at-most-once plus unbounded retries if exactly once despite client/server failures, equivalent to two phase commit Go RPC is "at-most-once" (and "usually once") open TCP connection write request to TCP connection TCP may retransmit, but server's TCP will filter out duplicates no retry in Go code (i.e. will NOT create 2nd TCP connection) Go RPC code returns an error if it doesn't get a reply perhaps after a timeout (from TCP) perhaps server didn't see request perhaps server processed request but server/net failed before reply came back Go RPC's at-most-once isn't enough What if RPC sent over TCP, but then reply never arrives and socket fails? Do we know if RPC was done, not done? If server crashed before, after, during the RPC? etc. if worker doesn't respond, the master re-send to it to another worker but original worker may have not failed, and is working on it too Go RPC can't detect this kind of duplicate In lab 2 you will have to protect against these kinds of duplicates RPC Concurrency threads are a fundamental server structuring tool you'll use them a lot in the labs go well together with RPC Go RPC allows multiple clients to call RPC concurrently Need to add synchronization (locks, channels) to prevent concurrent access to shared data. lock granularity goarse-grained -> little concurrency/parallelism fine-grained -> lots of concurrency, but race and deadlocks 2. Three RPC calls in the mapreduce go code. Register: from worker -> master, to say worker is ready for work DoJob: from master -> worker, to say here is work (either Map or Reduce) Shutdown: from master -> worker, to say we're done By convention in all of the labs, RPC's have two arguments, arg *FuncArgs, reply *FuncReply and return error. When start up, master initializes itself then creates the workers; but how does it know when the workers are ready to receive work? Within a go program, it could just create a channel and put work into it. But since we're a distributed system, need to create a socket to the worker -- only works once the worker is listening for the socket. So: The slaves intialize, then register with the master; saying its ok to send work. // called from worker.go - ready to accept work // defined in mapreduce.go -- tell master I'm ready to accept work // args are defined in common.go func (mr *MapReduce) Register(args *RegisterArgs, res *RegisterReply) error // also need an RPC to do work // called from master // implemented in worker func (wk *Worker) DoJob(arg *DoJobArgs, res *DoJobReply) error RPC's can fail -- e.g., communication failure, worker failure, etc. If so, the caller will get back an error // What are the semantics of the return values from RPC? ok := call(worker, "Worker.DoJob", args, &reply) // if ok, call performed // if !ok, was call performed? // For Lab1, failures are fail-stop: if ok = false, worker did not and will // never touch the intermediate files that it was asked to create // We'll remove this assumption in Lab 2. // There's also an RPC to tell worker to shut down func (wk *Worker) Shutdown(args *ShutdownArgs, res *ShutdownReply) error Finally, RPC's are blocking, so in order for the master to do more than one MapR task at the same time, it needs multiple RPC's to be outstanding. Hence you'll need to create a thread per worker or a thread per RPC, and you'll need to keep track of which tasks are still to be done, and which are complete. Only start Reduce tasks once all Map tasks are done c. part iii is to handle worker failures I ended up implementing part ii and part iii at the same time. If you hand a Map task to a worker, and the RPC fails, then you need to hand it to a different worker -- e.g., put it back on the task list d. RPC Concurrency This will be more relevant in Lab 2, but worth mentioning here. At the server, RPCs from different clients will run concurrently. RPC's from the same client run sequentially. Here, we have one master, so its not an issue. But an implication is that if the implementation of an RPC accesses shared state, you'll need locks or channels to provide mutual exclusion. 2. MapReduce: we had gotten most of the way through the discussion, but I did want to say a few more words. Example: grep (look for which lines contain a pattern of text) master splits file into M almost equal chunks at line boundaries calls Map on each partition map phase for each partition, call map on each line of text search line for word output line number, line of text if word shows up, nil if not partition results among R reducers map writes each output record into a file, partitioned on line # reduce phase Reduce job collects 1/R output from each Map job all map jobs have completed! reduce(k1, v1) -> v2 identity function: v1 in, v1 out merges phase: master merges R outputs Questions: 0. Number of RPCs: M+R jobs 1. how many intermediate files? Example: 200K M; 5K R; 2K N 2. How do "stragglers" impact performance? Assume: M, R > N servers Even so, reduce phase can't start until last map task finishes; mapR as a whole can't finish until last reduce job finishes. Solution in paper? Retry any task that doesn't complete quickly. 3. Performance: how much speed up do we get on N machines? ideal: N why less than ideal? stragglers network to collect a Reduce partition network to interact with FS disk I/O 4. Fault tolerance model master is not fault tolerant assumption: this single machine won't fail during running a mapreduce app but many workers, so have to handle their failures assumption: workers are fail stop they fail and stop (e.g., don't send garbled weird packets after a failure) they may reboot (in lab 1 -- they fail and don't rejoin) What kinds of faults might we want to tolerate? network: lost packets duplicated packets temporary network failure server disconnected network partitioned server: server crash+restart (master versus worker?) server fails permanently (master versus worker?) all servers fail simultaneously -- power/earthquake bad case: crash mid-way through complex operation what happens fail in the middle of map or reduce? bugs -- (mostly) not in this course what happens when bug in map or reduce? same bug in Map over and over? management software kills app malice -- (mostly) not in this course Tools for dealing with faults? retry -- e.g. if pkt is lost, or server crash+restart packets (TCP) and MR jobs may execute MR job twice! replicate -- e.g. if one server or part of net has failed Lab 2 replace -- for long-term health e.g., worker Retry jobs network failure: oops execute job twice ok for map reduce, because map/reduce produces same output map/reduce are "functional" or "deterministic" how about intermediate files? atomic rename worker failure: may have executed job or not so, we may execute job more than once! but ok for MapReduce as long Map and Reduce function are deterministic what would make Map or Reduce not deterministic? is executing a request twice in general ok? no. in fact, often not. unhappy customer if you execute one credit card transaction several times adding servers easy in MapReduce--just tell master hard in general server may have lost state (need to get new state) server may have rebooted quickly may need to recognize that to bring server up to date server may have a new role after reboot (e.g., not the primary) these harder issues you would have to deal with to make master fault tolerant topic of later labs