Today: achieving consensus via Paxos and how to use this Two weeks ago (and ongoing): - Lab 2 - Primary/backup replication - Chain Replication - Limitations of this approach: - Lab 2 - can only tolerate one replica failure (and sometimes not even that!) - both: need to have this view service -- punting on the problem? - How do we make an actually fault tolerant service? Last week: - Consensus & FLP - The consensus problem - multiple processes start with an input value - processes run protocol, then output chosen value - all non-faulty processes choose the same value Paxos: - algorithm for achieving consensus in an asynchronous network - can be used to implement a state machine - guarantees safety despite any number of faults - supposedly guarantees progress when any majority of replicas are running [really? what about FLP?] - actually: majority of replicas are running and can communicate for long enough to run protocol Paxos history: - first Paxos paper, Lamport 1990 - Viewstamped Replication, Oki & Liskov 1989 - first Paxos paper published, 1998 - research systems, ~mid 90s - first practical deployments, ~mid 2000s - Paxos everywhere: 2010s - Lamport wins Turing Award, 2014 [Liskov in 2008] Why such a long gap? - before its time: how many distributed systems in 1990? - paper is incredibly complicated! - ...especially the original paper, which was in metaphor! - Lynch et al did not recognize it as having anything to do with distributed systems Meanwhile, at MIT: - Liskov and group develop Viewstamped Replication: essentially the same protocol, although nobody realizes it for decades - was part of a distributed transaction system and language (Argus project), hard to disentangle the replication part - VR Revisited paper, 2012: tries to do this - also ~2012: RAFT project @ Stanford tries to make easy-to-understand consensus algorithm, similar to VR - Liskov received Turing Award in 2008 primarily for the idea of abstract data types and their application to programming languages (OOP) Three challenges about Paxos: - how does it work? - why does it work? - how do we actually use it to build a system? [these are in roughly increasing order of difficulty, I think!] Why is replication hard? - split brain problem - consider Lab 2 design: primary and backup are unable to communicate with each other over the network, but clients can communicate with one or both - should backup consider primary failed and start processing requests? - what if the primary considers the backup failed and keeps processing requests? - could have differing states from that point! - Lab 2 (& Chain Replication) deal with this by having a view server - whichever replica the view server can talk to is up, the other loses Using consensus for a state machine: (big picture) - need at least 3 replicas [why in a minute], no designated primary, no viewserver - replicas maintain a log of operations - clients send requests to some replica - that replica proposes that client's request as the next operation in log - once replicas achieve consensus on the next entry in the log, execute it and return to the client Today: two approaches to using Paxos - basic approach (Lab 3) - run an instance of Paxos per entry in the log - more sophisticated approach (Viewstamped Replication) - use Paxos to elect a primary (aka leader, distinguished proposer) - more efficient - avoids liveness problems Paxos-per-operation approach (Lab 3): - 3 replicas, no viewserver - each replica has a log: numbered list of Put/Get operations - clients send RPC to any replica (not just primary) - replica starts Paxos proposal for latest log entry to append op - this is a separate Paxos instance from previous log entries - agreement might result in a different op for that log entry! - once agreement received, execute log entries up to that point, respond to client Terminology: - proposers propose a value - acceptors choose one of the proposed values - learners find out which value has been chosen - in the labs, and in pretty much every real Paxos system, every node plays *all three* roles what does Paxos provide? Lab 3 interface on each server: Start(seq, v) -- to propose v as value for instance seq fate, v := Status(seq) -- to find out the agreed value for instance seq correctness: if agreement reached, all agreeing servers agree on same value corollary: once any agreement reached, never changes its mind critical since, after agreement, servers may update state or reply to clients (may not agree if too many lost messages, crashed servers) example: client sends Put(a,b) to S1 S1 picks log entry 3 S1 uses Paxos to get all servers to agree that entry 3 holds Put(a,b) example: client sends Get(a) to S2 S2 picks log entry 4 S2 uses Paxos to get all servers to agree that entry 4 holds Get(a) S2 scans log up to entry 4 to find latest Put(a,...) S2 replies with that value summary of how to use Paxos for RSM: a log of Paxos instances each instance's value is a client command different instances' Paxos agreements are independent this is how Lab 3B works now let's switch to how a single Paxos agreement works Why is agreement hard? - might be multiple conflicting proposals for a given log slot - need to make progress even if there are failures Example 1: - S1 hears Put(x)=1 for op 2, S2 hears Put(x)=3 - each one must do *something* with the operation once it receives it (or we won't make progress) - yet clearly one must change its decision - so we need a multiple round protocol - and the idea of tentative results - challenge: how do we know whether a result is tentative or permanent? Example 2: - S1 and S2 want to accept value Put(x)=1 for slot 2 - S3 and S4 don't respond - Want to be able to complete agreement with failed servers -- so are S3 and S4 failed? - Or are they partitioned, and making the same argument to accept a different value for that slot? - in other words, what about that split brain problem? Key ideas in Paxos: - need multiple rounds but they will converge on some value - use a majority quorum for agreement - this prevents the split brain problem Key idea: majority quorum - Suppose we want to tolerate f failures. We need n=2f+1 replicas. - Every operation need to communicate w/ a majority: f+1 replicas. Why? - Any operation needs to be able to proceed after hearing responses from n-f replicas, because f replicas might be faulty - can't wait for them. - But those replicas might not have been faulty, just slow. Then f replicas out of our quorum might fail. - We need at least one replica with our operation to still be alive: (n-f) - f >= 1 => n >= 2f+1 Another reason for majority quorums: - Avoiding the split brain problem - Suppose we talk to a majority of servers on each request - The previous operation also achieved a majority quorum - Key property: any two majority quorums intersect at at least one replica! - ...so the second operation is guaranteed to see the first one - if the system is partitioned, whichever partition has a majority wins - and if no partition has a majority, we can't make progress -- this has to be true to avoid split brains! The mysterious f: - f is the number of failures we can tolerate - for Paxos, n=2f+1 (Chain Replication: n=f+1; there are other protocols w/ n=3f+1) - could we choose n > 2f+1? yes, but no point - where does f come from? engineering decision trading off # failures we can tolerate vs # servers we need Paxos overview: - proposers select a value - proposers submit proposal to acceptors, try to assemble a majority - might be concurrent proposers, e.g. with multiple clients submitting different ops Strawman: - proposers send propose(v) to all acceptors, wait for responses - if they got a matching quorum, success, otherwise, declare it failed - what can go wrong? - three-way split - nobody achieves a quorum - but maybe we're OK with this -- we could just call that a failed instance and have everyone retry. - more subtle problem - say there are three replicas, two accept value A and one accepts value B - what if one of the A's fails? - then we're left with one A and one B -- how can future clients know which operation won? - implication: we'll need a multiphase protocol Paxos diagram: basic Paxos exchange: proposer acceptors propose(n) -> <- propose_ok(n, n_a, v_a) accept(n, v') -> <- accept_ok(n) decided(v') -> What are n and v? - n is an id for a given proposal -- to distinguish proposers - and to distinguish multiple attempts from the same proposer - unique sequential id, e.g., n = - note, n is an id for a proposal -- this has nothing to do with a "instance". entirely within one instance. - v is the value that the proposer wants accepted Definition: server S accepts n/v S responded accept_ok to accept(n, v) Definition: n/v is chosen a majority of servers accepted n/v note that this is a system-wide property Key safety property: - once a value has been chosen, no other value can be chosen - algorithm can't change its mind! - even if there's a different proposer (and proposal ID), need same value! - this is the safety property we need to respond safely to a client and provide linearizability and durability - again note that "chosen" is a systemwide property so no replica can tell locally that a value has been chosen! Paxos protocol idea: - proposer sends propose(n) with a proposal ID, *but* doesn't choose a value yet - acceptors respond with any value they've already accepted and promise not to accept a proposal with higher ID - when proposer gets a majority of responses, - if there was a value already accepted, propose that value - otherwise, propose any value it wants (e.g., a client request) The Paxos protocol itself: proposer(v): choose n, unique and higher than any n seen so far send proposal(n) to all servers including self if proposal_ok(n, n_a, v_a) from majority: v' = v_a with highest n_a; choose own v otherwise send accept(n, v') to all if accept_ok(n) from majority: send decided(v') to all acceptor state: must persist across reboots n_p (highest proposal seen) n_a, v_a (highest accept seen) acceptor's proposal(n) handler: if n > n_p NOTE: promising not to accept any more proposals < n n_p = n reply proposal_ok(n, n_a, v_a) else reply proposal_reject acceptor's accept(n, v) handler: if n >= n_p n_p = n n_a = n v_a = v reply accept_ok(n) (and notify learners) else reply accept_reject Example 1: common case, no contention proposer: broadcast proposal(n=1.0) each acceptor sets n_p = 1.0, responds proposal_ok(1.0, nil) proposer receives quorum, chooses its value V, sends accept(1.0, V), each acceptor sets n_a = 1.0, v_a = V, replies accept_ok(n=1.0, v=V) What happens if another proposer wants to propose value Y now? Depends on what proposal number it chooses - if n' < n: acceptors all respond proposal_reject - if n' > n: acceptors respond proposal_ok(n', X) - proposer is obligated to carry out response with value X - even if the first proposer hadn't finished the accept phase, second one must complete it for it What is the commit point of the Paxos algorithm? [this is a surprisingly subtle question!] - i.e., the point at which, regardless of what failures subsequently happen, the algorithm will always proceed to the same value? - answer: once a majority of acceptors has sent accept_ok(n, v) - why isn't it when a majority of proposers have sent proposal_ok(n)? - after all, at that point no other proposal can succeed with a different value! - one reason: there might have been a proposal with a higher number that's gotten proposal_ok, it could still be accepted - another: what about the case before where 2 replicas proposal_ok'd X, and 1 Y, and one of the X's fails? - this is a knowledge vs common knowledge thing: - not enough that a majority of replicas have seen proposal(n) - need a majority to *know* that a majority have seen proposal(n)! Example 2: S1 starts proposing n=10 S1 sends out just one accept v=X S3 starts proposing n=11 but S1 does not receive its proposal S3 only has to wait for a majority of proposal responses S1: p10 a10X S2: p10 p11 S3: p10 p11 a11Y S1 is still sending out accept messages... has a value been chosen? No - no majority of accept(11) could it go either way (X or Y) at this point? No, X is impossible. what will happen? what will S2 do if it gets a10X accept msg from S1? - Reject - n_p is higher what will S1 do if it gets a11Y accept msg from S3? - Accept & change n_a what if S3 were to crash at this point (and not restart)? - S3 won't complete its proposal (because it's crashed) - S1 can't complete its proposal of X - any other server that starts a proposal will accept Y S1: n_p=10, n_a=10, v_a=X S2: n_p=10 ... n_p=11 S3: n_p=10 ... n_p=11, n_a=11, v_a=Y why does the proposer need to pick v_a with highest n_a? S1: p10 a10A p12 S2: p10 p11 a11B S3: p10 p11 a11B p12 a12? n=11 already agreed on vB n=12 sees both vA and vB, but must choose vB why: two cases: 1. there was a majority before n=11 n=11's prepares would have seen value and re-used it so it's safe for n=12 to re-use n=11's value 2. there was not a majority before n=11 n=11 might have obtained a majority so it's required for n=12 to re-use n=11's value why does prepare handler check that n > n_p? it doesn't have to: a proposer that fails the check in prepare handler will fail same check in accept handler why does accept handler check n >= n_p? to ensure later proposer sees any possible chosen value by preventing acceptance of old value once an acceptor has responded to new proposer's prepare w/o n >= n_p check, you could get this bad scenario: S1: p1 p2 a1A S2: p1 p2 a1A a2B S3: p1 p2 a2B oops, for a while A was chosen, then changed to B! why does accept handler update n_p = n? required to prevent earlier n's from being accepted server can get accept(n,v) even though it never saw prepare(n) without n_p = n, can get this bad scenario: S1: p1 a2B a1A p3 a3A S2: p1 p2 p3 a3A S3: p2 a2B oops, for a while B was chosen, then changed to A! what if proposer S2 chooses n < S1's n? e.g. S2 didn't see any of S1's messages S2 won't make progress, so no correctness problem what if an acceptor crashes after receiving accept? S1: p1 a1X S2: p1 a1X reboot p2 a2? S3: p1 p2 a2? the story: S2 is the only intersection between p1's and p2's majorities thus the only evidence that Paxos already chose X so S2 *must* return X in prepare_ok so S2 must be able to recover its pre-crash state thus: if S2 wants to re-join this Paxos instance, it must remember its n_p/v_a/n_a on disk. what if an acceptor reboots after sending prepare_ok? does it have to remember n_p on disk? if n_p not remembered, this could happen: S1: p10 a10X S2: p10 p11 reboot a10X a11Y S3: p11 a11Y 11's proposer did not see value X, so 11 proposed its own value Y but just before that, X had been chosen! b/c S2 did not remember to ignore a10X What about FLP? - no deterministic algorithm for solving consensus is both safe (correct) and live (terminates eventually) - Paxos is an algorithm for solving consensus - safe, but not live - how does it get stuck? - yes, if there is not a majority that can communicate - how about if a majority is available? if proposers immediately retry w/ higher n after accept_reject, they can all keep each other from getting accepts accepted - what can we do about this? - don't retry immediately! pause a random amount of time, then re-try - designate one proposer Multi-Paxos - everything we said before was about a single instance of the log - in reality we are running a series of paxos instances for different operations - optimization that's widely used: have one replica run the consensus for each operation. - that replica can run the first phase for multiple instances at once - propose(n) => set n_p to n for this instance and all subsequent instances - we can jump to the accept(n, v) phase ---------- Viewstamped Replication Equivalent protocol to Paxos presented in terms of state machine replication (see also RAFT) This is *exactly* Multi-Paxos - so in a sense we don't even need to say any more - but it is a systems-oriented view Starting point: - n=2f+1 replicas, designate one of them the primary - each one maintains a log of operations, numbered, either PREPARED or COMMITTED - primary receives ops from client, assigns sequence number - then does 2-phase commit to all of the replicas - PREPARE - PREPARE-OK [it's always OK since this is replication -- no ABORT possibility] - execute op and reply to client - COMMIT - how does this compare to chain replication? - essentially same idea, except we're doing it in parallel rather than sequential down a chain - 2PC is not very fault-tolerant: requires responses from all So let's require only a majority quorum of responses - f+1 PREPARE-OKs (including the primary) instead of 2f+1 - this system can tolerate failures of up to f backups - minor details: what if one backup gets op n+1 without getting n - this couldn't happen before because the primary couldn't move on to n+1 w/o hearing success of n from everyone - need state transfer protocol: replica A sends to other replicas reply [or really a whole sequence of the log] Now the hard part: what if the primary crashes? - need to detect it: timeouts work - need to replace it with a new primary - need to make sure that the new primary knows about all committed ops - need to keep the old primary from completing ops [why?] - need to make sure that there are no race conditions while we do this! How do we replace the primary? - basically, use Paxos! - assign a view number in every replica, view number determines the primary - replicas do not process requests when their view number does not match - some replica notices the primary is faulty, sends to all others - on receiving : - increment view number; stop processing requests! - send to new primary including log of all previous ops [this is slightly simplified from the paper] - when primary receives from majority: - take the log with the highest seen (not necessarily committed) op - install that log - send to all replicas Why is this correct? Ab-initio answer: - the new primary sees every operation that could possibly have completed - majority quorum intersection - every completed operation was processed by f+1 replicas - new primary got f+1 DO-VIEW-CHANGES, so it'd be in one of them - but what if the old primary is trying to commit new operations? - can there be a race condition? - no: once a replica sends DO-VIEW-CHANGE it stops listening to old primary - so either the old primary talked to some replica before it sent DO-VIEW-CHANGE in which case the new primary knows about it or it couldn't reach a majority [quorum intersection again!] Another answer: well, it's Paxos - same idea: view change = propose a new primary - a two-phase protocol where you need responses from a majority - extract a promise from the other replicas not to accept ops in the old view - and the proposer has to find out the latest operations accepted in the old view and propose them as part of the new view Specifically: - view number = proposal number - start-view-change(v) = propose(v) - do-view-change(v) = propose_ok(v) - start-view(v, log) = accept(v, l) for appropriate instance - prepare(v, opnum, op) = accept(v, op) for instance opnum - prepare_ok(v, opnum) = accept_ok(v, opnum, op) for instance opnum ---------- Paxos performance What determines Paxos performance? - really want to consider Multi-Paxos common case here - latency is determined by four message delays (req, prepare, prepareok, reply) - hard to do much about this (except Fast Paxos) - throughput: usual bottleneck is the number of messages processed by the leader - needs to send and receive one message from every replica - lots of tricks you can play to improve this Batching - have the leader collect requests, and only run one batch periodically - amortizes overhead of Paxos run over all of them - throughput essentially comes down to how fast you can process RPC and execute ops - from our work: ~30k ops/sec w/o batching, ~120k with. - helps throughput, hurts latency - there's a latency-optimal strategy that won't *hurt* your latency much Partitioning - most of the time we're not running just one Paxos group - e.g., one group per shard in a key-value store - so have each machine run replicas in multiple groups, act as leader in only one - same number of messages, but spreads the load around better - alternative approach: have just one Paxos group, but partition the instance space - replica 1 is the leader for slots 1, 4, 7, ... - this is Mencius [OSDI'08] - lots of details to get right - can improve throughput; however, might have to wait for replica 1 to complete slot 4 before you can execute slot 5 -- even if it didn't have anything to propose there.