Why do we want to order events? Suppose I wanted to build a distributed make server - central file server that holds source and object files - each file has a timestamp - clients specify the timestamps when they write a file - use timestamps to determine dependencies: rebuild object file O if it depends on some source file X that is newer - so we ought to be able to rebuild just what we need What could go wrong? - client clocks aren't synchronized, so client could put in files with old timestamp; won't rebuild dependent files - we could use the server's clock... if there's only one server - how synchronized would we need to be? granularity = enough to synchronize Another example: posting to Facebook - remove boss as friend - post "My boss is the worst, I need a new job" - really don't want to get these in the wrong order! Could we wind up doing these things in the wrong order? - The data is not stored in just one server - there are O(1M) servers - each request actually requires talking to O(100) servers! - there are lots of layers of caching: memcache in the data center, cross-datacenter replication, edge caches - how do we update all of these things consistently? - can we just use the wall clock? no, same reason Is this really a problem? - How well are clocks synchronized in practice? - (How would we measure this?) - Amazon EC2 measurements range from ~50us within the same datacenter to 200+ms for wide-area links - and highly variable - Is this bad? You can do a RPC in ~3us (Linux) or even ~0.4 (Arrakis) so conceivably you could be off by 125 - 625,000 RPCs! - put another way, 200ms is a "human-scale" (noticable) number Two approaches: - physical clocks - logical clocks Physical clock synchronization Straw-man approach: - pick a server as a master and have it broadcast the time - every client receives this message and sets its clock to the master's - how do we set the master's clock? - why doesn't this work? Latency distribution - not uniform - why doesn't it go to zero? speed of light delays; transmission delays; routing delays - what causes the variance? queuing delays on the network, processing delays on the client/server - can it go arbitrarily high? packet loss; retries; routing loops Can we build a better clock synchronization algorithm? Interrogation-based approach - client sends RPC to server, server replies with time - client keeps track of send and receive time - assume symmetric latency - is this true? not really, but not much we can do about it - set clock to server + 1/2 latency How precise can we get this? - depends on variance in latency - what does the accuracy depend on? Doing better: - divide latency into minimum + variance - average over multiple readings and/or multiple servers - try to manage the clock rate skew - use hardware support to track added delays (PTP) Logical clocks (Lamport clocks) - alternative way to keep track of time - based on the idea that events that aren't causally related are concurrent - doesn't require any form of physical clock or synchronization Definitions - process - abstractly, a sequence of events that are ordered - instructions in a thread of execution - operations being processed on a server (assuming there's no concurrency within the server!) - etc - events: either local computation, or send/receive(message) - messages: any communication between two processes Happens-before relationship: - within a thread, P1 before P2 means P1 -> P2 - if a = send(M) and b = receive(M), a -> B - transitivity: a -> b & b -> c => a -> c (why?) Examples - does p1 happen-before p3? yes, same thread - does p1 happen before r1? no - does p1 happen-before q2? yes, message - does p1 happen-before r4? yes, chain What does all this mean? - a -> b means "b could have influenced a" - if a --/--> b then must b -> a? no! - what does this mean? these events are concurrent! - what does it mean for events to be concurrent? - if we had to put the events in a single global order, we could put either a or b first - it would not affect the system, i.e., nobody could tell the difference! Abstract logical clock - if a -> b then C(a) < C(b) means: - if a, b on same process P_i, C_i(a) < C_i(b) - if a = send(m) on P_i and b = recv(m) on P_j, C_i(a) < C_j(b) - ...then C_i and C_j are comparable How to maintain these timestamps? - note that there are multiple possible implementations. here's one: - each process i increments a counter C_i between any two events - each process i sends T_m (= C_i at time of send) with each message - on receiving message m, process j sets C_j to max(C_j+1, T_m+1) What does this give us? - if a -> b then C(a) < C(b) - Is the converse true: if C(a) < C(b) then a -> b - no - p3 and q3 in the figure - they could also be concurrent - if we were to use the Lamport clock as a global order than we would be creating some unnecessary constraints Could we build a logical clock where the converse *is* true? - yes. - note that this clock will still tell us some events are concurrent! - strawman: the clock is actually a dependency list, e.g. a list of all previous events whenever i receives a message from j, we add that i's dependency list to j's. - more efficient strawman: vector clocks (later this quarter!) Suppose I only want to avoid causality anomalies. No message that depends on a past event is delivered before that event. - With Lamport clocks, need to know whether anyone will send you an earlier - message, before you can deliver a message. ---------- Distributed snapshots Suppose we're running a large ML computation, e.g. PageRank - thousands of servers - each holds some subset of web pages - each page starts out with some reputation - each iteration: some of that page's reputation gets transferred to the pages it links to (which might be on different servers!) What if a server crashes? - How did MapReduce deal with this? [reassign its jobs to someone else] - Why can't we do that here? [the servers have state] If we wanted to take checkpoints, how often would we need to do that? - How would we synchronize them? - What would go wrong? - Could we just discard messages that we sent more than once? ---------- How do we build a system that transparently tolerates server failures? - Replication! - Use (at least) f+1 replicas to tolerate f failures [what kind of failures?] - Act just like a single copy to the clients - Challenge is coordinating operations so that they are applied with same results to all replicas even if messages drop and replicas fail State machine replication - nobody ever realizes this, but the earliest mention is in Lamport's clocks paper! - Model the system as a state machine - maintains some amount of state - transition function (input, state) -> new state - output function (input, state) -> output - In other words, system state and output is entirely determined by inputs Can everything be modeled as a state machine? - needs to be deterministic what about clocks, randomness, etc? parallelism on multicores? - state needs to include everything Goal: achieve a consistent order of operations - what does consistent mean here? - is causal enough? - serializable: there is a single global order of operations consistent with what each client sees - actually strict serializable / linearizable: notion of time included too; if a finishes before b starts, then a comes before b in the serial order There are many ways to achieve this goal - today: chain replication - Lab 2: simpler primary backup replication (simplified chain rep) - two weeks: Paxos / Viewstamped Replication Why are we doing chain replication: - used occasionally, e.g., HyperDex - an instantiation of primary-backup, which is used widely (database replication) - fairly straightforward - similar to Lab 2 - will compare to other approaches (Paxos) later Primary copy approach - some form of this is very common in replication - key idea: have a primary sequence all requests - when a primary is active, all other replicas execute requests in the primary's assigned order - client sees results consistent with that order - client doesn't get results back / declare success until request processed by enough replicas [here, f+1/f+1; elsewhere f+1/2f+1] - when the primary fails, we replace it and make sure that the new primary knows the order of all successful operations *this is the really hard part!* What assumptions does CR make about the environment? - operations are read/write - f+1 nodes to tolerate f failures - nodes fail only by crashing, and crashes are detected - fault-tolerant master service keeps track of the system membership How realistic are these assumption? Chain replication approach - organize nodes into a chain - updates sent to head, propagated down chain, tail responds - key invariant: each node has a superset of the state all subsequent nodes in the chain - what is the commit point? - can do the same with reads, or send them directly to the tail Failures - what happens if the tail fails? - remove it and make the predecessor the new tail - is this ok? tail didn't have any operations anyone else didn't have too - new tail might have processed some operations the old tail hadn't, but that's strictly better - what happens if the head fails? - just remove it and tell everyone that the next node is the head - can we lose operations? yes, if the head was the only one to process them - is this ok? yes - either some replica did, in which case all others will, or it's lost entirely, which is ok too - what happens if a node in the middle failed? - cut it out of the chain - but need to ensure that we didn't drop any requests on the floor - use acknowledgments so that everyone knows how far the tail has gotten - have the predecessor resend any messages that the tail hasn't processed yet - what happens when we add a node? - have the tail send it all operations - once it has the state, it is ready to start serving as the tail - have the old tail forward all new operations to the new - notify master and clients about the new tail - important to get this order right! - what if the master fails? game over! lab 2 overview: agreement: "view server" decides who p and b are clients and servers ask view server they don't make independent decisions only one vs, avoids multiple machines independently deciding who is p repair: view server can co-opt "idle" server as b after old b becomes p primary initializes new backup's state the tricky part: 1. only one primary! 2. primary must have state! we will work out some rules to ensure these view server maintains a sequence of "views" view #, primary, backup 1: S1 S2 2: S2 -- 3: S2 S3 monitors server liveness each server periodically sends a Ping RPC "dead" if missed N Pings in a row "live" after single Ping can be more than two servers Pinging view server if more than two, "idle" servers if primary is dead new view with previous backup as primary if backup is dead, or no backup new view with previously idle server as backup OK to have a view with just a primary, and no backup but -- if an idle server is available, make it the backup how to ensure new primary has up-to-date replica of state? only promote previous backup i.e. don't make an idle server the primary but what if the backup hasn't had time to acquire the state? how to avoid promoting a state-less backup? example: 1: S1 S2 S1 stops pinging viewserver 2: S2 S3 S2 *immediately* stops pinging 3: S3 -- potential mistake: maybe S3 never got state from S2 better to stay with 2/S2/S3, maybe S2 will revive how can viewserver know it's OK to change views? lab 2 answer: primary in each view must acknowledge that view to viewserver viewserver must stay with current view until acknowledged even if the primary seems to have failed no point in proceeding since not acked == backup may not be initialized Q: can more than one server think it is primary? 1: S1, S2 net broken, so viewserver thinks S1 dead but it's alive 2: S2, -- now S1 alive and not aware of view #2, so S1 still thinks it is primary AND S2 alive and thinks it is primary => split brain, no good [In particular, a client C1 that didn't see the new view could contact S1 while another client C2 could contact S2; if they updated the same key, that would result in an inconsistency.] how to ensure only one server acts as primary? even though more than one may *think* it is primary "acts as" == executes and responds to client requests the basic idea: 1: S1 S2 2: S2 -- S1 still thinks it is primary S1 must forward ops to S2 S2 thinks S2 is primary so S2 can reject S1's forwarded ops the rules: 1. primary in view i must have been primary or backup in view i-1 2. primary must wait for backup to accept each request (incl. get) 3. primary must wait responding to client until backup acked it execute request 4. non-backup must reject forwarded requests 5. non-primary must reject direct client requests 6. every operation must be before or after state transfer (see below) example: 1: S1, S2 viewserver stops hearing Pings from S1 2: S2, -- it may be a while before S2 hears about view #2 before S2 hears about view #2 S1 can process ops from clients, S2 will accept forwarded requests S2 will reject ops from clients who have heard about view #2 [and trigger a thread to fetch new view] after S2 hears if S1 receives client request, it will forward, S2 will reject so S1 can no longer act as primary S1 will send error to client, client will ask vs for new view, client will re-send to S2 the true moment of switch-over occurs when S2 hears about view #2 how can new backup get state? e.g. the key/value state if S2 is backup in view i, but was not in view i-1, S2 should ask primary to transfer the complete state rule for state transfer: every operation (Put/Get) must be either before or after state xfer if before, xferred state must reflect operation if after, primary must forward operation after xfer finishes Q: does primary need to forward Get()s to backup? after all, Get() doesn't change anything, so why does backup need to know? and the extra RPC costs time A: Remember Get/Put/PutHash must provide exactly once semantics. Assume the primary doesn't forward a Get. In the example above, with C1 and C2 in contact with old P and old B/new P, where old P and C1 hasn't seen the new view, then reads would be returned from the old P -- nothing in a read would force the old P to discover the view change. Suppose concurrently, C2 updates the key at new P. C1 will read the key but not see the new result. Q. Is it ever ok to read only at the primary? YES! If the view has no backup, then there can be no new P without notifying the old primary first, so there's no problem. Also Puts can be done at the primary alone, if there is no backup. They must, if the system is to be available, you need to be able to do operations with only a primary and no backup. Q. At MIT, the example they used to show that you can't read only at the primary was this one. But it is actually not a problem! Why? C1 sends Get to P, but not to B C1 doesn't receive response C2 sends Put, which is processed P and B P fails B takes over C1 resends Get, and observes Put, even though it shoudn't Q: how could we make primary-only Get()s work? Q: are there cases when the lab 2 protocol cannot make forward progress? Yes. Many. View service fails Primary fails before it acknowledges view Backup fails before it gets new state .... We will start fixing those in lab 3