Today: two primary-back up systems wrap up Lab 2 discussion high availability deals with network partition Hypervisor-based fault tolerance deals with non-determinism but doesn't tolerate network partitions lab 2 goals: availability and correctness despite server and network failures tolerate network problems, including partition either keep going, correctly or suspend operations until network is repaired replacement of failed servers lab 2 overview: agreement: "view server" decides who p and b are clients and servers ask view server they don't make independent decisions only one vs, avoids multiple machines independently deciding who is p repair: view server can co-opt "idle" server as b after old b becomes p primary initializes new backup's state view server maintains a sequence of "views" monitors server liveness each server periodically sends a Ping RPC "dead" if missed N Pings in a row "live" after single Ping if primary is dead new view with previous backup as primary if backup is dead, or no backup new view with previously idle server as backup OK to have a view with just a primary, and no backup VS ignores server failures if the backup is not up to date VS does not change views unless primary has ack'ed primary only acks if backup is uptodate Q: what if primary fails and there is no backup? wait for it to recover Q: under what circumstances is the system unavailable? if viewserver fails if network fails if network partitions so that client can't speak to primary selected by vs if only one server (primary), and the server fails if two servers, but the primary fails before ack'ing the view change [backup could in fact be up to date, but vs not told yet] Q: can more than one server think it is primary? 1: S1, S2 net broken, so viewserver thinks S1 dead but it's alive 2: S2, -- now S1 alive and not aware of view #2, so S1 still thinks it is primary AND S2 alive and thinks it is primary => split brain, no good [In particular, a client C1 that didn't see the new view could contact S1 while another client C2 could contact S2; if they updated the same key, that would result in an inconsistency.] how to ensure only one server acts as primary? even though more than one may *think* it is primary "acts as" == executes and responds to client requests the basic idea: 1: S1 S2 2: S2 -- S1 still thinks it is primary S1 must forward ops to S2 S2 thinks S2 is primary so S2 can reject S1's forwarded ops the rules: 1. primary in view i must have been primary or backup in view i-1 2. primary must wait for backup to accept each request (incl. get) 3. primary must wait responding to client until backup acked it execute request 4. non-backup must reject forwarded requests 5. non-primary must reject direct client requests 6. every operation must be before or after state transfer (see below) example: 1: S1, S2 viewserver stops hearing Pings from S1 2: S2, -- it may be a while before S2 hears about view #2 before S2 hears about view #2 S1 can process ops from clients, S2 will accept forwarded requests S2 will reject ops from clients who have heard about view #2 [and trigger an RPC to fetch new view] after S2 hears if S1 receives client request, it will forward, S2 will reject so S1 can no longer act as primary S1 will send error to client, client will ask vs for new view, client will re-send to S2 the true moment of switch-over occurs when S2 hears about view #2 how can new backup get state? e.g. the key/value state if S2 is backup in view i, but was not in view i-1, S2 should ask primary to transfer the complete state rule for state transfer: every operation (Put/Get) must be either before or after state xfer if before, xferred state must reflect operation if after, primary must forward operation after xfer finishes Q: does primary need to forward Get()s to backup? after all, Get() doesn't change anything, so why does backup need to know? and the extra RPC costs time A: Remember Get/Put/PutHash must behave as if they occur in some sequential order on a server that never fails -- that is, exactly once semantics. Assume the primary doesn't forward a Get. In the example above, with C1 and C2 in contact with old P and old B/new P, where old P and C1 hasn't seen the new view, then reads would be returned from the old P -- nothing in a read would force the old P to discover the view change. Suppose concurrently, C2 updates the key at new P. C1 will read the key but not see the new result. [For those who took 444, this means the system is not linearizable -- a GET that starts after the PUT returns, should always return the result of the PUT. A GET that starts before the PUT returns, is allowed to return either the data before or after the PUT.] Q. Is it ever ok to read only at the primary? YES! If the view has no backup, then there can be no new P without notifying the old primary first, so there's no problem. Also Puts can be done at the primary alone, if there is no backup. They must, if the system is to be available, you need to be able to do operations with only a primary and no backup. Q. At MIT, the example they used to show that you can't read only at the primary was this one. But it is actually not a problem! Why? C1 sends Get to P, but not to B C1 doesn't receive response C2 sends Put, which is processed P and B P fails B takes over C1 resends Get, and observes Put, even though it shoudn't Q: how could we make primary-only Get()s work? Q: are there cases when the lab 2 protocol cannot make forward progress? Yes. Many. View service fails Primary fails before it acknowledges view Backup fails before it gets new state .... We will start fixing those in lab 3 An implementation hint: Two ways you can think of implementing the state transfer for Lab 2b. 1. Maintain the current map; transfer the current map. Between operations! All RPCs appear to occur either before or after the transfer. But recall that "at most once" means you need to save the response that you previously gave to an RPC. If backup takes over, needs to act as if it was the primary. Only one RPC outstanding at a time, so means state = map + result of last RPC from each client. 2. All that state can be derived from the operation log: the sequence of RPC's processed at the primary. So a potentially simpler, less efficient solution is just to record the operational log, and transfer that. The current value of a hash key is just the last value written in the log for that key. 3. In practice, most systems would do a hybrid: take a checkpoint of the operational log so that it doesn't grow without bound, and then transfer the checkpoint plus all operations that occurred since the checkpoint. Another implementation hint: For Lab 2a, Irene sketched the state transition diagram for the VS. What is the state transition diagram for Lab 2b? You can describe the state transition diagram for the servers and clients, but is that enough to convince you that the system behaves as a whole like you want it to? Instead, the behavior of the distributed system is the cross-product of the states of the individual nodes -- the state at the clients, the state at the servers, the state at the VS. Transitions are message send/receive and local events -- e.g., if the state is that the primary has the old view and the VS has a new view, an event would be the timer expiration leading (eventually) to a new state where the primary has the new view. Of course many possible paths are allowable -- depending on which messages are delivered in which order, which nodes do which actions. We can analyze the graph to determine whether the program has the property we want -- e.g., can always perform a client request; always responds with the latest value of a hash key, etc. This is where the snapshots paper comes in: a consistent state of a distributed system is one that is reachable through a set of legal transitions. In any state there are likely many possible transitions. Maybe one client sends a message, maybe another, maybe the server fails. So if we want to know if the system is in deadlock, we need to take a consistent snapshot of the system -- recording only local states, such that if a local state depends (could be caused by) some event on a different processor, the state on that processor must include that event. CASE STUDY: HYPERVISOR-BASED FAULT-TOLERANCE Bressoud and Schneider (SOSP 1995) Detailed description of a primary/backup system Very general -- not application specific like the labs Motivation Goal: fault tolerance / availability for any existing s/w by running same instructions &c on two computers Transparent: any app, any O/S Would be magic if it worked well! Failure model: computer fail due to independent hw faults Plan 1: [simple diagram] Two machines Identical start state: registers, program, memory contents Maybe they will stay identical If one fails, the other just keeps going, no time/data/work is lost What will go wrong with Plan 1? external outputs will be duplicated must send external inputs to both inputs must arrive at the same time at both interrupts must arrive at same time at both CPUs unlikely to run at exactly the same speeds The basic plan: Ignore backup's I/O instructions Hide interrupts from backup Primary forwards I/O results and interrupts to backup Inject those interrupts into backup Q: do we have to be exact about timing of interrupts? can backup just fake an interrupt when msg arrives from primary? How are they able to control I/O and interrupt timing? they slip a virtual machine / hypervisor under the O/S What is a hypervisor? a piece of software emulates a real machine precisely so an O/S &c can run on a hypervisor "guest" as if hypervisor simulated each instruction and kept simulated registers, memory, &c BUT as much as possible runs guest on real h/w! can be much faster than interpreting instructions Many fascinating details, see VMware papers for x86 story Hypervisor gives us control Suppress output from backup, detect interrupts/input at primary, &c Still need a scheme for delivering I/O inputs and interrupts From primary to backup, at the right times But: hard to know when interrupt happened at primary, repeat at backup Answer: HP PA-RISC recovery counter It forces an interrupt every N instructions This allows primary and backup hypervisor to get control at exactly the same point in the code Epochs Primary alternates epoch and end of epoch; backup does too Every epoch is same # of instructions, e.g. 4000 instructions Primary during epoch E: Deliver interrupts buffered during E-1 Interrupts deliver input, and acknowlege output Then continue executing I/O instructions start input/output (but not done until interrupt) Hypervisor hides new interrupts, just buffer, w/ I/O input Primary at end of epoch E: Send buffered interrupt info, including I/O input, to backup Wait for backup to ACK Send "done with E" to backup Backup during epoch E: Deliver interrupts, and I/O input, primary told us of during E-1 Then continue executing I/O instructions do nothing Ignore interrupts from backup devices Backup at end of epoch E: Receive and buffer interrupt/input msgs from primary Wait for "done with E+1" from primary Start epoch E+1 Note backup lags primary by one epoch Backup doesn't start 7 until primary is done with 7 What if primary fails during epoch E? Backup finishes with E-1 Backup times out waiting for "done with E" Can backup just switch to being primary for epoch E? No: primary might have crashed in the middle of writing device registers May not be correct for backup to repeat I/O register writes Example: Maybe for a disk read the driver is required to write disk controller registers in a specific order, e.g. track #, sector #, length, DMA address Primary did *some* of those writes before crashing Backup might then write track # when h/w expecting sector # Good news: many h/w devices can be reset to known state And I/O operations restarted Existing drivers already know how to handle "please reset me" interrupts At any rate, true for HP-UX disk driver Failover strategy: Backup times out waiting for "done with E" Backup executes epoch E as backup (no output, I/O suppressed) Switches to primary for E+1 Hypervisor generates "uncertain interrupts" for I/O started <= E Including suppressed I/O during failover epoch E During E+1: Drivers respond to uncert interrupts w/ reset, repeat of waiting I/O ops (all this is based on footnote 5 on page 4, and IO1 and P8 on page 5) Potential problem: repeated output Maybe primary crashed just as it finished E And it completed some output, e.g. disk write or network pkt send Hypervisor will generate uncertain interrupt for those I/O ops Backup will repeat them in E+1 Will the outside world tolerate repeated output? In general, maybe not E.g. controlling an instrument in a scientific experiment Network protocols: No problem, Must handle duplicated packets anyway Shared disk: No problem, repeated write of same data Do O/S drivers on primary talk directly to h/w? I suspect O/S talks to device simulated/mediated by hypervisor Hypervisor needs to be able to collect input read from devices and sent it to backup So hypervisor must be intervening (and understand) access to dev h/w Also hypervisor must initialize dev h/w when backup takes over Figure 1 shows a shared disk Rather than separate disk per machine with identical content Only the primary writes the disk Why does shared disk make the design easier? Disk writes are non deterministic one disk may have bad sector and write fails so (unlike RAM) backup's disk won't naturally be identical to primary's Simplifies recovery of failed machine Don't need to copy whole disk from survivor Won't disk failures ruin the fault-tolerance story? Can be fixed separately, w/ RAID And my memory of 1990 is that disks failed much less often than computers crashed Today's question: What if system had private disks instead of one dual-ported disk? Would that make the system more fault-tolerant? should backup suppress disk I/O instructions? should backup suppress disk interrupts? should backup directly read its own disk? or should primary forward read data? What if ethernet cable breaks? primary still running backup will try to promote itself that's part of what the fail-stop assumption assumes away... fail-stop is reasonable for a tightly coupled system they could probably make a 10-foot ethernet pretty reliable or have two separate ethernets What if we want to re-start a failed+repaired primary? What if the computers have multiple cores + threads? Would this scheme work w/o modification? What we can expect for performance? When will performance be bad? Frequent interrupts -- may be delayed by epoch scheme Lots of input -- must be sent over ethernet Many privileged instructions -- many hypervisor traps Should epochs be short or long? Short means many end-of-epoch pauses + chatter Long means less overhead But I/O interrupts delayed longer What performance do they see? CPU-bound: Figure 2 Disk-bound (really seek-bound): Figure 3 Why do writes do better than reads? Why much better relative performance for disk-bound than CPU-bound? Why does Figure 3 write performance level off? Why isn't it heading down towards 1.0, as in Figure 2? What is the limiting factor in performance? 442-microsecond epoch overhead 442 us is 22100 instructions (50 mHz) So 22000-instruction epochs -> 2x slowdown for CPU-bound Plus time to emulate privileged instructions Is the 442 microseconds CPU time, or network time? Figure 4 ATM experiment suggests CPU time, but not clear Does anyone use these ideas today? VMware has a fault-tolerant VM system Same basic idea, but more complete and sophisticated no epochs primary has no restrictions on where interrupts occur backup can cause them to occur at same place primary holds each output until backup ACKs to ensure backup will produce same output if primary fails but primary can continue executing while it waits fault-tolerant network disks copes with partition by test-and-set on disk at most one of primary and backup will win no progress if network disk not reachable automatic creation of new backup after failure on some spare VM host, don't need to repair same hardware much faster: only 10% slow-down, not paper's 2X PRIMARY-BACKUP REPLICATION IN LAB 2 what wider classes of failure would we like to handle? temporary or permanent loss of connectivity network partitions == can't know if a server is crashed or just not reachable what is the risk of having network failure between primary/backup clients, primary, backup might not agree abt who is primary or about whether backup is alive (and thus if it is up to date) result: two primaries e.g. some clients switch to backup, others don't or one client switches back and forth they don't see each other's updates! "split brain" does *not* achieve goal of acting like single server