Today: Wrap up Lamport Clocks Outline Lab 2 Primary-backup replication 1. Recall: Lamport clock is a weird way to keep time, but it has the property that if a causes (happens before) b (by happening on the same node, or by sending a message) then C(a) < C(b). 2. Clock update rule IR1: increment on every event within a process IR2: put current timestamp in each message; on receipt, Cj = max(current, message + 1) Tiebreaking rule: then we Partial order. To get a total order, Cj < Cj+1 at node j, j+1. 3. Using logical clocks for mutual exclusion. Correctness: one at a time everyone in order of request Liveness: (in absence of failures: if every holder releases the lock) eventually get the lock Implementation: Every message timestamped with value of logical clock. 1) to request lock, multicast request Tm to everyone 2) if get a lock request message, put Tm on local queue, ack (once it is ack'ed from everyone, request is in every queue) 3)to release the lock, remove request from local queue, multicast release message 4)if get release message, remove that request from local queue 5)Process i gets lock if its request is before (=>) everyone else's, and none earlier can arrive (have a later timestamp from everyone) Example of state machine; everyone operates on the same information, so everyone stays in synch. Example: start with set of processes, each with a request queue. P0 starts with lock, at T0. In everyone’s request queue. P1 asks for lock, at T1. P2 asks for lock, at T2. P0 releases lock, at T2. Any problems? What if there are failures? Lab 2 shows how to generalize this to a handle failures 4. Causality Suppose I only want to avoid causality anomalies. No message that depends on a past event is delivered before that event. Example: if a sensor detects a fire, a triggers a sprinkler, no one should see it as "sprinkler then fire". Or the lunch example. With Lamport clocks, need to know whether anyone will send you an earlier message, before you can deliver a message. A different implementation: vector clocks. Vector clock: what is latest event I have seen from each process? VC[i] is the latest event at process i Pass vector clock in each message, and update local clock on message receipt (same as before, only element by element max). For all i, VC[i] = max (VC[i], msg-VC[i] + 1) VC defines a partial order for events. If every node multicasts each event (so that everyone eventually sees every event) then vector clocks can ensure you don't process an event until you've seen all prior events from each process. This helps with constructing consistent snapshots, a topic I'll discuss next week. Snapshot is consistent if taken at the frontier of VC[i] – no events that depend on any that are missing. 5. Primary-backup goals availablity: still useable despite [some class of] failures correctness: act just like a single server to clients also: network failures and repair 6. state machine replication replicate state (e.g., key/value store) apply operations in the same order view each replica as a state machine each replica starts in the same state applies all operations in the same order -> they end up in the same state assumption: operations are deterministic what if operation needs the time? if two replicas call gettime() they may get different answers many variants of state machine replication today: primary/backup approach 7. basic primary/backup multiple clients, sending requests concurrently two server roles: primary, backup which server is primary may change due to failures clients send operations to primary (e.g. Put, PutHash, Get) primary chooses the order primary forwards operations (or results) to backup primary replies to client many challenges: non-deterministic operations network partition state transfer between primary and backup new backups after backup becomes primary two primary-back up systems today Hypervisor-based fault tolerance deals with non-determinism but doesn't tolerate network partitions Lab 2 deals with network partition lab 2 goals: availability and correctness despite server and network failures like the hypervisor paper, but also: tolerate network problems, including partition either keep going, correctly or suspend operations until network is repaired replacement of failed servers lab 2 overview: agreement: "view server" decides who p and b are clients and servers ask view server they don't make independent decisions only one vs, avoids multiple machines independently deciding who is p repair: view server can co-opt "idle" server as b after old b becomes p primary initializes new backup's state the tricky part: 1. only one primary! 2. primary must have state! we will work out some rules to ensure these view server maintains a sequence of "views" view #, primary, backup 1: S1 S2 2: S2 -- 3: S2 S3 monitors server liveness each server periodically sends a Ping RPC "dead" if missed N Pings in a row "live" after single Ping can be more than two servers Pinging view server if more than two, "idle" servers if primary is dead new view with previous backup as primary if backup is dead, or no backup new view with previously idle server as backup OK to have a view with just a primary, and no backup but -- if an idle server is available, make it the backup how to ensure new primary has up-to-date replica of state? only promote previous backup i.e. don't make an idle server the primary but what if the backup hasn't had time to acquire the state? how to avoid promoting a state-less backup? example: 1: S1 S2 S1 stops pinging viewserver 2: S2 S3 S2 *immediately* stops pinging 3: S3 -- potential mistake: maybe S3 never got state from S2 better to stay with 2/S2/S3, maybe S2 will revive how can viewserver know it's OK to change views? lab 2 answer: primary in each view must acknowledge that view to viewserver viewserver must stay with current view until acknowledged even if the primary seems to have failed no point in proceeding since not acked == backup may not be initialized Q: can more than one server think it is primary? 1: S1, S2 net broken, so viewserver thinks S1 dead but it's alive 2: S2, -- now S1 alive and not aware of view #2, so S1 still thinks it is primary AND S2 alive and thinks it is primary => split brain, no good [In particular, a client C1 that didn't see the new view could contact S1 while another client C2 could contact S2; if they updated the same key, that would result in an inconsistency.] how to ensure only one server acts as primary? even though more than one may *think* it is primary "acts as" == executes and responds to client requests the basic idea: 1: S1 S2 2: S2 -- S1 still thinks it is primary S1 must forward ops to S2 S2 thinks S2 is primary so S2 can reject S1's forwarded ops the rules: 1. primary in view i must have been primary or backup in view i-1 2. primary must wait for backup to accept each request (incl. get) 3. primary must wait responding to client until backup acked it execute request 4. non-backup must reject forwarded requests 5. non-primary must reject direct client requests 6. every operation must be before or after state transfer (see below) example: 1: S1, S2 viewserver stops hearing Pings from S1 2: S2, -- it may be a while before S2 hears about view #2 before S2 hears about view #2 S1 can process ops from clients, S2 will accept forwarded requests S2 will reject ops from clients who have heard about view #2 [and trigger a thread to fetch new view] after S2 hears if S1 receives client request, it will forward, S2 will reject so S1 can no longer act as primary S1 will send error to client, client will ask vs for new view, client will re-send to S2 the true moment of switch-over occurs when S2 hears about view #2 how can new backup get state? e.g. the key/value state if S2 is backup in view i, but was not in view i-1, S2 should ask primary to transfer the complete state rule for state transfer: every operation (Put/Get) must be either before or after state xfer if before, xferred state must reflect operation if after, primary must forward operation after xfer finishes Q: does primary need to forward Get()s to backup? after all, Get() doesn't change anything, so why does backup need to know? and the extra RPC costs time A: Remember Get/Put/PutHash must provide exactly once semantics. Assume the primary doesn't forward a Get. In the example above, with C1 and C2 in contact with old P and old B/new P, where old P and C1 hasn't seen the new view, then reads would be returned from the old P -- nothing in a read would force the old P to discover the view change. Suppose concurrently, C2 updates the key at new P. C1 will read the key but not see the new result. Q. Is it ever ok to read only at the primary? YES! If the view has no backup, then there can be no new P without notifying the old primary first, so there's no problem. Also Puts can be done at the primary alone, if there is no backup. They must, if the system is to be available, you need to be able to do operations with only a primary and no backup. Q. At MIT, the example they used to show that you can't read only at the primary was this one. But it is actually not a problem! Why? C1 sends Get to P, but not to B C1 doesn't receive response C2 sends Put, which is processed P and B P fails B takes over C1 resends Get, and observes Put, even though it shoudn't Q: how could we make primary-only Get()s work? Q: are there cases when the lab 2 protocol cannot make forward progress? Yes. Many. View service fails Primary fails before it acknowledges view Backup fails before it gets new state .... We will start fixing those in lab 3