Viewstamped Replication and Beyond 1) Brief discussion of Metasync, errata from previous class: Someone asked if we require dropbox to have serializable updates. Its a bit weaker than what I said. The message log has to be serializable for writes; for reads it can return any snapshot (consistent read, but ok if it is done in the past). For file operations, our software writes a complete copy of the new file system, and then puts a pointer to that copy in the message log, along with a pointer to the previous version of the file system. We only consider an update as having committed if no other commit occurred in the meantime (even if it was "accepted" at a majority). How do you update files on a file synch service without removing any old data? So that the update can be atomic. Content-based naming: each file block has a hash on its contents, and the name of the block is its hash value. 2) Viewstamped replication. Many find the Viewstamp paper easier to understand wrt how to use Paxos in a real system. Central goal: maintain a replicated operation log, that is consistent everywhere, and available for performing operations if a majority are up (for a sufficient length of time). progress (in some cases) and safety (in all cases) Log feeds a replicated state machine (e.g., the hash table in assignment 2): perform operation once it has been assigned the same location in a majority of logs. 2.1 Simple primary/backup replication Liskov proceeds by starting simple: Client sends to primary Primary picks log order, forwards to backups Backup enters into log in order, replies Primary gathers majority (commit!), does op, and replies to client [Backups can now also do op] 2.2 View change What happens if primary goes down? Draw picture of logs at primary and backup. At primary, some ops committed, some in progress, some not requested yet. Same at backup: some ops at a majority of backups. Some ops not at a majority. Any node can call for a new view. New view is established when new leader collects a majority. (Note that an even newer view might have been started in the meantime! So getting a majority only implies that no further progress will occur for new requests from earlier views.) The new primary collects logs from the majority, and uses the one with the highest seen (not necessarily committed) operation. (This is where it is similar to Paxos.) Anything in any log as committed is committed. Anything else needs to reach a majority before the new primary can declare it as committed. But the highest numbered operation in any log *might* have been committed, and so the new primary can only put new client requests after that operation. 2.3 Cold vs. Hot Backups How up to date you need to keep the backup? Cold backups: just log the sequence of ops. Recovery then requires rebuilding state from the log Hot standby backups: want new replica to step in instantly 2.4 Fast reads 2.4.1 Leases Give a lease to the primary; primary revalidates before lease expires. No new view is allowed until lease expires. 2.4.2 Snapshot reads Allow primary to return an old (but consistent) read 2.5 Byzantine resilience Goal: resilience to a small number of incorrect participants, either due to bugs or due to malice [Let’s put off a discussion as to whether that’s a reasonable assumption or not] Prior to the Castro + Liskov paper, best known algorithm for BFT was exponential in the number of nodes (!). Castro and Liskov show that it is “just” quadratic Where in Viewstamped replication could a Byzantine node mess things up? Client sends to primary, gets back response [primary could be Byz] Primary picks an order and forwards to backups; could give a different order to different backups Backup might reply to request without logging it Backup might trigger view change when its not needed Backup might provide bogus log in view change Solution: do everything as a group. like in a horror movie – ok, everyone go around in pairs. Does that work? Seems absurd! One of you is a zombie, and so I go out alone with someone who might be a zombie! No way! Transitivity requires evidence. If primary says (to a client) it got enough replies from backups, why should it believe it? But if it got signed replies from enough backups, ok! Tolerate f failures, f Byzantine nodes, need 3f + 1 replicas Client sends signed request to primary [client must have a fallback position that avoids this primary!] Primary sends pre-prepare to everyone (with client sig) Gets back 2f+1 replies (f failures, f byzantine nodes => at least f+1 correct nodes) Primary sends commit to everyone (with 2f+1 replica sigs) Replicas reply back to client with signed result of operation Client done when it gets f+1 replies