Today: achieving consensus via Paxos and how to use this

Two weeks ago (and ongoing):
 - Lab 2 - Primary/backup replication
 - Chain Replication
 - Limitations of this approach:
  - Lab 2 - can only tolerate one replica failure (and sometimes not
    even that!)
  - both: need to have this view service -- punting on the problem?
  - How do we make an actually fault tolerant service?

Last week:
 - Consensus & FLP
 - The consensus problem
    - multiple processes start with an input value
    - processes run protocol, then output chosen value
    - all non-faulty processes choose the same value

Paxos:
 - algorithm for achieving consensus in an asynchronous network
 - can be used to implement a state machine
 - guarantees safety despite any number of faults
 - supposedly guarantees progress when any majority of replicas are
   running
   [really? what about FLP?]
 - actually: majority of replicas are running and can communicate for
   long enough to run protocol

Paxos history:
 - first Paxos paper, Lamport 1990
 - Viewstamped Replication, Oki & Liskov 1989
 - first Paxos paper published, 1998
 - research systems, ~mid 90s
 - first practical deployments, ~mid 2000s
 - Paxos everywhere: 2010s
 - Lamport wins Turing Award, 2014 [Liskov in 2008]

Why such a long gap?
 - before its time: how many distributed systems in 1990?
 - paper is incredibly complicated!
 - ...especially the original paper, which was in metaphor!
 - Lynch et al did not recognize it as having anything to do with
   distributed systems

Meanwhile, at MIT:
 - Liskov and group develop Viewstamped Replication: essentially the
   same protocol, although nobody realizes it for decades
 - was part of a distributed transaction system and language (Argus
   project), hard to disentangle the replication part
 - VR Revisited paper, 2012: tries to do this
 - also ~2012: RAFT project @ Stanford tries to make
   easy-to-understand consensus algorithm, similar to VR
 - Liskov received Turing Award in 2008 primarily for the idea of
   abstract data types and their application to programming languages
   (OOP)

Three challenges about Paxos:
 - how does it work?
 - why does it work?
 - how do we actually use it to build a system?
 [these are in roughly increasing order of difficulty, I think!]

Why is replication hard?
 - split brain problem
 - consider Lab 2 design: primary and backup are unable to communicate
   with each other over the network, but clients can communicate with
   one or both
 - should backup consider primary failed and start processing
   requests?
 - what if the primary considers the backup failed and keeps
   processing requests?
 - could have differing states from that point!
 - Lab 2 (& Chain Replication) deal with this by having a view server
   - whichever replica the view server can talk to is up, the other loses

Using consensus for a state machine: (big picture)
  - need at least 3 replicas [why in a minute], no designated primary,
    no viewserver
  - replicas maintain a log of operations
  - clients send requests to some replica
  - that replica proposes that client's request as the next operation
    in log
  - once replicas achieve consensus on the next entry in the log,
    execute it and return to the client

Today: two approaches to using Paxos
 - basic approach (Lab 3)
   - run an instance of Paxos per entry in the log
 - more sophisticated approach (Viewstamped Replication)
   - use Paxos to elect a primary (aka leader, distinguished proposer)
   - more efficient
   - avoids liveness problems

Paxos-per-operation approach (Lab 3):
 - 3 replicas, no viewserver
 - each replica has a log: numbered list of Put/Get operations
 - clients send RPC to any replica (not just primary)
 - replica starts Paxos proposal for latest log entry to append op
   - this is a separate Paxos instance from previous log entries
   - agreement might result in a different op for that log entry!
 - once agreement received, execute log entries up to that point,
   respond to client

Terminology:
 - proposers propose a value
 - acceptors choose one of the proposed values
 - learners find out which value has been chosen
 - in the labs, and in pretty much every real Paxos system, every node
   plays *all three* roles

what does Paxos provide?
  Lab 3 interface on each server:
    Start(seq, v) -- to propose v as value for instance seq
    fate, v := Status(seq) -- to find out the agreed value for instance seq
  correctness:
    if agreement reached, all agreeing servers agree on same value
    corollary: once any agreement reached, never changes its mind
    critical since, after agreement, servers may update state or reply to clients
    (may not agree if too many lost messages, crashed servers)

example:
  client sends Put(a,b) to S1
  S1 picks log entry 3
  S1 uses Paxos to get all servers to agree that entry 3 holds Put(a,b)
  
example:
  client sends Get(a) to S2
  S2 picks log entry 4
  S2 uses Paxos to get all servers to agree that entry 4 holds Get(a)
  S2 scans log up to entry 4 to find latest Put(a,...)
  S2 replies with that value

summary of how to use Paxos for RSM:
  a log of Paxos instances
  each instance's value is a client command
  different instances' Paxos agreements are independent
  this is how Lab 3B works

now let's switch to how a single Paxos agreement works

Why is agreement hard?
 - might be multiple conflicting proposals for a given log slot
 - need to make progress even if there are failures

Example 1:
  - S1 hears Put(x)=1 for op 2, S2 hears Put(x)=3
  - each one must do *something* with the operation once it receives
    it (or we won't make progress)
  - yet clearly one must change its decision
  - so we need a multiple round protocol
  - and the idea of tentative results
  - challenge: how do we know whether a result is tentative or
  permanent?

Example 2:
  - S1 and S2 want to accept value Put(x)=1 for slot 2
  - S3 and S4 don't respond
  - Want to be able to complete agreement with failed servers -- so
    are S3 and S4 failed?
  - Or are they partitioned, and making the same argument to accept a
    different value for that slot?
  - in other words, what about that split brain problem?

Key ideas in Paxos:
  - need multiple rounds but they will converge on some value
  - use a majority quorum for agreement
    - this prevents the split brain problem

Key idea: majority quorum
 - Suppose we want to tolerate f failures. We need n=2f+1 replicas.
 - Every operation need to communicate w/ a majority: f+1 replicas. Why?
 - Any operation needs to be able to proceed after hearing responses
   from n-f replicas, because f replicas might be faulty - can't wait
   for them.
 - But those replicas might not have been faulty, just slow. Then f
   replicas out of our quorum might fail.
 - We need at least one replica with our operation to still be alive:
   (n-f) - f >= 1   =>   n >= 2f+1

Another reason for majority quorums:
 - Avoiding the split brain problem
 - Suppose we talk to a majority of servers on each request
 - The previous operation also achieved a majority quorum
 - Key property: any two majority quorums intersect at at least one
   replica!
 - ...so the second operation is guaranteed to see the first one
 - if the system is partitioned, whichever partition has a majority
 wins
   - and if no partition has a majority, we can't make progress --
     this has to be true to avoid split brains!

The mysterious f:
 - f is the number of failures we can tolerate
 - for Paxos, n=2f+1 (Chain Replication: n=f+1; there are other
   protocols w/ n=3f+1)
 - could we choose n > 2f+1? yes, but no point
 - where does f come from? engineering decision trading off # failures
   we can tolerate vs # servers we need

Paxos overview:
 - proposers select a value
 - proposers submit proposal to acceptors, try to assemble a majority
   - might be concurrent proposers, e.g. with multiple clients
     submitting different ops

Strawman:
 - proposers send propose(v) to all acceptors, wait for responses
 - if they got a matching quorum, success, otherwise, declare it
   failed
 - what can go wrong?
   - three-way split - nobody achieves a quorum
     - but maybe we're OK with this -- we could just call that a
       failed instance and have everyone retry.
   - more subtle problem
     - say there are three replicas, two accept value A and one
       accepts value B
     - what if one of the A's fails?
     - then we're left with one A and one B -- how can future clients
       know which operation won?
 - implication: we'll need a multiphase protocol
     

Paxos diagram:
basic Paxos exchange:
 proposer        acceptors
     propose(n) ->
  <- propose_ok(n, n_a, v_a)
     accept(n, v') ->
  <- accept_ok(n)
     decided(v') ->

What are n and v?
 - n is an id for a given proposal -- to distinguish proposers
    - and to distinguish multiple attempts from the same proposer
    - unique sequential id, e.g., n = <time, server id>
    - note, n is an id for a proposal -- this has nothing to do with a
      "instance". entirely within one instance.
 - v is the value that the proposer wants accepted

Definition: server S accepts n/v
  S responded accept_ok to accept(n, v)

Definition: n/v is chosen
  a majority of servers accepted n/v
  note that this is a system-wide property

Key safety property:
  - once a value has been chosen, no other value can be chosen
  - algorithm can't change its mind!
  - even if there's a different proposer (and proposal ID), need same
  value!
  - this is the safety property we need to respond safely to a client
    and provide linearizability and durability
  - again note that "chosen" is a systemwide property so no replica
    can tell locally that a value has been chosen!

Paxos protocol idea:
 - proposer sends propose(n) with a proposal ID, *but* doesn't choose a
   value yet
 - acceptors respond with any value they've already accepted
   and promise not to accept a proposal with higher ID
 - when proposer gets a majority of responses,
   - if there was a value already accepted, propose that value
   - otherwise, propose any value it wants (e.g., a client request)

The Paxos protocol itself:

proposer(v):
  choose n, unique and higher than any n seen so far
  send proposal(n) to all servers including self
  if proposal_ok(n, n_a, v_a) from majority:
    v' = v_a with highest n_a; choose own v otherwise
    send accept(n, v') to all
    if accept_ok(n) from majority:
      send decided(v') to all

acceptor state:
  must persist across reboots
  n_p (highest proposal seen)
  n_a, v_a (highest accept seen)

acceptor's proposal(n) handler:
  if n > n_p
    NOTE: promising not to accept any more proposals < n
    n_p = n
    reply proposal_ok(n, n_a, v_a)
  else
    reply proposal_reject

acceptor's accept(n, v) handler:
  if n >= n_p
    n_p = n
    n_a = n
    v_a = v
    reply accept_ok(n)
    (and notify learners)
  else
    reply accept_reject


Example 1: common case, no contention
  proposer: broadcast proposal(n=1.0)
  each acceptor sets n_p = 1.0, responds proposal_ok(1.0, nil)
  proposer receives quorum, chooses its value V, sends accept(1.0, V),
  each acceptor sets n_a = 1.0, v_a = V, replies accept_ok(n=1.0, v=V)

What happens if another proposer wants to propose value Y now?
  Depends on what proposal number it chooses
    - if n' < n: acceptors all respond proposal_reject
    - if n' > n: acceptors respond proposal_ok(n', X)
      - proposer is obligated to carry out response with value X
      - even if the first proposer hadn't finished the accept phase,
        second one must complete it for it

What is the commit point of the Paxos algorithm?
[this is a surprisingly subtle question!]
  - i.e., the point at which, regardless of what failures subsequently
    happen, the algorithm will always proceed to the same value?
  - answer: once a majority of acceptors has sent accept_ok(n, v)
  - why isn't it when a majority of proposers have sent
    proposal_ok(n)?
      - after all, at that point no other proposal can succeed with a
        different value!
      - one reason: there might have been a proposal with a higher
        number that's gotten proposal_ok, it could still be accepted
      - another: what about the case before where 2 replicas
        proposal_ok'd X, and 1 Y, and one of the X's fails?
      - this is a knowledge vs common knowledge thing:
        - not enough that a majority of replicas have seen proposal(n)
        - need a majority to *know* that a majority have seen
          proposal(n)!

Example 2:

S1 starts proposing n=10
S1 sends out just one accept v=X
S3 starts proposing n=11
  but S1 does not receive its proposal
  S3 only has to wait for a majority of proposal responses
S1: p10 a10X
S2: p10        p11
S3: p10        p11  a11Y
S1 is still sending out accept messages...
has a value been chosen? No - no majority of accept(11)
could it go either way (X or Y) at this point? No, X is impossible.
what will happen? 
  what will S2 do if it gets a10X accept msg from S1? - Reject - n_p
  is higher
  what will S1 do if it gets a11Y accept msg from S3? - Accept & change n_a
what if S3 were to crash at this point (and not restart)?
  - S3 won't complete its proposal (because it's crashed)
  - S1 can't complete its proposal of X
  - any other server that starts a proposal will accept Y

S1: n_p=10, n_a=10, v_a=X
S2: n_p=10 ... n_p=11
S3: n_p=10 ... n_p=11, n_a=11, v_a=Y

why does the proposer need to pick v_a with highest n_a?
S1: p10  a10A               p12
S2: p10          p11  a11B  
S3: p10          p11  a11B  p12   a12?
n=11 already agreed on vB
n=12 sees both vA and vB, but must choose vB
why: two cases:
  1. there was a majority before n=11
     n=11's prepares would have seen value and re-used it
     so it's safe for n=12 to re-use n=11's value
  2. there was not a majority before n=11
     n=11 might have obtained a majority
     so it's required for n=12 to re-use n=11's value

why does prepare handler check that n > n_p?
  it doesn't have to: a proposer that fails the check in
    prepare handler will fail same check in accept handler

why does accept handler check n >= n_p?
  to ensure later proposer sees any possible chosen value
    by preventing acceptance of old value once an acceptor
    has responded to new proposer's prepare
  w/o n >= n_p check, you could get this bad scenario:
  S1: p1 p2 a1A
  S2: p1 p2 a1A a2B
  S3: p1 p2     a2B
  oops, for a while A was chosen, then changed to B!

why does accept handler update n_p = n?
  required to prevent earlier n's from being accepted
  server can get accept(n,v) even though it never saw prepare(n)
  without n_p = n, can get this bad scenario:
  S1: p1    a2B a1A p3 a3A
  S2: p1 p2         p3 a3A
  S3:    p2 a2B
  oops, for a while B was chosen, then changed to A!

what if proposer S2 chooses n < S1's n?
  e.g. S2 didn't see any of S1's messages
  S2 won't make progress, so no correctness problem

what if an acceptor crashes after receiving accept?
S1: p1  a1X
S2: p1  a1X reboot  p2  a2?
S3: p1              p2  a2?
the story:
  S2 is the only intersection between p1's and p2's majorities
  thus the only evidence that Paxos already chose X
  so S2 *must* return X in prepare_ok
  so S2 must be able to recover its pre-crash state
thus: if S2 wants to re-join this Paxos instance,
  it must remember its n_p/v_a/n_a on disk.

what if an acceptor reboots after sending prepare_ok?
  does it have to remember n_p on disk?
  if n_p not remembered, this could happen:
  S1: p10            a10X
  S2: p10 p11 reboot a10X a11Y
  S3:     p11             a11Y
  11's proposer did not see value X, so 11 proposed its own value Y
  but just before that, X had been chosen!
  b/c S2 did not remember to ignore a10X


What about FLP?
 - no deterministic algorithm for solving consensus is both safe
   (correct) and live (terminates eventually)
 - Paxos is an algorithm for solving consensus
 - safe, but not live
 - how does it get stuck?
 
 - yes, if there is not a majority that can communicate
 - how about if a majority is available?
     if proposers immediately retry w/ higher n after accept_reject,
       they can all keep each other from getting accepts accepted
 - what can we do about this?
     - don't retry immediately! pause a random amount of time, then
       re-try
     - designate one proposer
 

Multi-Paxos
 - everything we said before was about a single instance of the log
 - in reality we are running a series of paxos instances for different
   operations
 - optimization that's widely used: have one replica run the consensus
    for each operation.
 - that replica can run the first phase for multiple  instances at once
 - propose(n) => set n_p to n for this instance and all subsequent
   instances
 - we can jump to the accept(n, v) phase


----------

Viewstamped Replication

Equivalent protocol to Paxos presented in terms of state machine
replication
 (see also RAFT)

This is *exactly* Multi-Paxos
 - so in a sense we don't even need to say any more
 - but it is a systems-oriented view

Starting point:
 - n=2f+1 replicas, designate one of them the primary
 - each one maintains a log of operations, numbered, either PREPARED
   or COMMITTED
 - primary receives ops from client, assigns sequence number
 - then does 2-phase commit to all of the replicas
   - PREPARE <opnum, op>
   - PREPARE-OK <opnum>  [it's always OK since this is replication --
                          no ABORT possibility]
   - execute op and reply to client                          
   - COMMIT <opnum>
 - how does this compare to chain replication?
   - essentially same idea, except we're doing it in parallel rather
     than sequential down a chain
 - 2PC is not very fault-tolerant: requires responses from all

So let's require only a majority quorum of responses 
 - f+1 PREPARE-OKs (including the primary) instead of 2f+1
 - this system can tolerate failures of up to f backups
 - minor details: what if one backup gets op n+1 without getting n
   - this couldn't happen before because the primary couldn't move on
     to n+1 w/o hearing success of n from everyone
   - need state transfer protocol:
     replica A sends <REQ-STATE-TRANSFER n> to other replicas
     reply <STATE-TRANSFER, n, op>
     [or really a whole sequence of the log]

Now the hard part: what if the primary crashes?
 - need to detect it: timeouts work
 - need to replace it with a new primary
  - need to make sure that the new primary knows about all committed
    ops
  - need to keep the old primary from completing ops [why?]
  - need to make sure that there are no race conditions while we do
  this!

How do we replace the primary?
  - basically, use Paxos!
  - assign a view number in every replica, view number determines the primary
    - replicas do not process requests when their view number does not
      match
  - some replica notices the primary is faulty, sends
    <START-VIEW-CHANGE v+1> to all others
  - on receiving <START-VIEW-CHANGE, v>:
    - increment view number; stop processing requests!
    - send <DO-VIEW-CHANGE, v, log> to new primary including log of all previous ops
    [this is slightly simplified from the paper]
  - when primary receives <DO-VIEW-CHANGE v> from majority:
    - take the log with the highest seen (not necessarily committed)
      op
    - install that log
    - send <START-VIEW, v, log> to all replicas

Why is this correct?

Ab-initio answer:
  - the new primary sees every operation that could possibly have
    completed
     - majority quorum intersection
     - every completed operation was processed by f+1 replicas
     - new primary got f+1 DO-VIEW-CHANGES, so it'd be in one of them
  - but what if the old primary is trying to commit new operations?
    - can there be a race condition?
    - no: once a replica sends DO-VIEW-CHANGE it stops listening to
      old primary
    - so either the old primary talked to some replica before it sent
      DO-VIEW-CHANGE in which case the new primary knows about it
      or it couldn't reach a majority [quorum intersection again!]     

Another answer: well, it's Paxos
  - same idea: view change = propose a new primary
    - a two-phase protocol where you need responses from a majority
    - extract a promise from the other replicas not to accept ops in
      the old view
    - and the proposer has to find out the latest operations accepted
      in the old view and propose them as part of the new view
      
  Specifically:
  - view number = proposal number
  - start-view-change(v) = propose(v)
  - do-view-change(v) = propose_ok(v)
  - start-view(v, log) = accept(v, l) for appropriate instance
  - prepare(v, opnum, op) = accept(v, op) for instance opnum
  - prepare_ok(v, opnum) = accept_ok(v, opnum, op) for instance opnum

----------

Paxos performance

What determines Paxos performance?
  - really want to consider Multi-Paxos common case here
  - latency is determined by four message delays
    (req, prepare, prepareok, reply)
       - hard to do much about this (except Fast Paxos)
  - throughput: usual bottleneck is the number of messages processed
    by the leader
      - needs to send and receive one message from every replica
      - lots of tricks you can play to improve this

Batching
 - have the leader collect requests, and only run one batch
   periodically
 - amortizes overhead of Paxos run over all of them
   - throughput essentially comes down to how fast you can process RPC
     and execute ops
   - from our work: ~30k ops/sec w/o batching, ~120k with.
 - helps throughput, hurts latency
   - there's a latency-optimal strategy that won't *hurt* your latency
     much

Partitioning
 - most of the time we're not running just one Paxos group
 - e.g., one group per shard in a key-value store
 - so have each machine run replicas in multiple groups, act as leader
   in only one
 - same number of messages, but spreads the load around better

 - alternative approach: have just one Paxos group, but partition the
   instance space
 - replica 1 is the leader for slots 1, 4, 7, ...
 - this is Mencius [OSDI'08]
 - lots of details to get right
 - can improve throughput; however, might have to wait for replica 1 to
   complete slot 4 before you can execute slot 5 -- even if it didn't
   have anything to propose there.