Viewstamped Replication and Beyond

1) Brief discussion of Metasync, errata from previous class: 

Someone asked if we require dropbox to have serializable updates.
Its a bit weaker than what I said.  The message log has to be serializable
for writes; for reads it can return any snapshot (consistent read, 
but ok if it is done in the past). 

For file operations, our software writes a complete copy of the new 
file system, and then puts a pointer to that copy in the message log,
along with a pointer to the previous version of the file system.
We only consider an update as having committed if no other commit occurred
in the meantime (even if it was "accepted" at a majority).

How do you update files on a file synch service without removing any
old data?  So that the update can be atomic.  Content-based naming: 
each file block has a hash on its contents, and the name of the 
block is its hash value.

2) Viewstamped replication. 

Many find the Viewstamp paper easier to understand wrt 
how to use Paxos in a real system.

Central goal: maintain a replicated operation log, that is consistent 
everywhere, and available for performing operations if a majority are up
(for a sufficient length of time).

progress (in some cases) and safety (in all cases)

Log feeds a replicated state machine (e.g., the hash table in assignment 2): 
perform operation once it has been assigned the same location in a 
majority of logs. 

2.1 Simple primary/backup replication

Liskov proceeds by starting simple:

Client sends to primary
Primary picks log order, forwards to backups
Backup enters into log in order, replies
Primary gathers majority (commit!), does op, and replies to client
[Backups can now also do op]


2.2 View change

What happens if primary goes down?

Draw picture of logs at primary and backup.
At primary, some ops committed, some in progress, some not requested yet.
Same at backup: some ops at a majority of backups.  Some ops not at a majority.

Any node can call for a new view.
New view is established when new leader collects a majority.
(Note that an even newer view might have been started in the meantime!
So getting a majority only implies that no further progress will occur
for new requests from earlier views.)

The new primary collects logs from the majority, and uses the one with
the highest seen (not necessarily committed) operation.  
(This is where it is similar to Paxos.)

Anything in any log as committed is committed.  Anything else needs
to reach a majority before the new primary can declare it as committed.
But the highest numbered operation in any log *might* have been committed, 
and so the new primary can only put new client requests after that operation.

2.3 Cold vs. Hot Backups

How up to date you need to keep the backup?

Cold backups: just log the sequence of ops.  
Recovery then requires rebuilding state from the log
Hot standby backups: want new replica to step in instantly

2.4 Fast reads

2.4.1 Leases

Give a lease to the primary; primary revalidates before lease expires.
No new view is allowed until lease expires.

2.4.2 Snapshot reads

Allow primary to return an old (but consistent) read

2.5 Byzantine resilience

Goal: resilience to a small number of incorrect participants, either 
due to bugs or due to malice

[Let’s put off a discussion as to whether that’s a reasonable assumption or not]

Prior to the Castro + Liskov paper, best known algorithm for BFT was 
exponential in the number of nodes (!).
Castro and Liskov show that it is “just” quadratic

Where in Viewstamped replication could a Byzantine node mess things up?

Client sends to primary, gets back response [primary could be Byz]

Primary picks an order and forwards to backups; could give a different
order to different backups

Backup might reply to request without logging it

Backup might trigger view change when its not needed

Backup might provide bogus log in view change

Solution: do everything as a group.  
like in a horror movie – ok, everyone go around in pairs.   
Does that work?  Seems absurd!  One of you is a zombie, and 
so I go out alone with someone who might be a zombie!  No way!

Transitivity requires evidence. If primary says (to a client)
it got enough replies from backups, why should it believe it? 
But if it got signed replies from enough backups, ok!

Tolerate f failures, f Byzantine nodes, need 3f + 1 replicas

Client sends signed request to primary 
[client must have a fallback position that avoids this primary!]

Primary sends pre-prepare to everyone (with client sig)

Gets back 2f+1 replies (f failures, f byzantine nodes => at 
least f+1 correct nodes)

Primary sends commit to everyone (with 2f+1 replica sigs)

Replicas reply back to client with signed result of operation

Client done when it gets f+1 replies