Practical Byzantine Fault Tolerance Castro and Liskov Basic idea: take the essence of the Byzantine Generals algorithm and build an acceptably efficient client-server system that can survive Byzantine failures among the servers. For reasons similar to those in the Schneider paper, the notion of surviving Byzantine client failures is ill-defined, and all that we can really try to do is carefully define what a faulty client can and cannot do to the system. Fundamentally, the algorithm is yet another incarnation of a replicated state machine, where voting is used to prevent a small number of faulty nodes from overriding the majority (it requires the same 3f+1 total nodes to tolerate f faulty nodes as before), and establishes a total order on the events in the system to keep all the non-faulty replicas in lockstep. Unlike most of the earlier work, this paper assumes a network with potentially unbounded delay. The algorithm guarantees safety even in the face of such an asynchronous network, but does not (and cannot) guarantee liveness unless some upper bound can be placed on network delay. This network model also requires the use of encryption to establish authentication, but since this is orthogonal to the main protocol I will treat it separately below. As in the Paxos algorithm, there is the notion of a normal-case behavior, where a "primary" node coordinates ordering and manages the consensus algorithm, and a bad-case (typically used only if the primary becomes faulty), where nodes communicate in a more expensive peer-to-peer style protocol to ensure safety. What I have always liked about this work is that the algorithm is simple, straightforward, and easy to describe. Unlike the recursive Byzantine Generals algorithm, it is easy to see how many steps and messages are required for each operation. The basic steps of the algorithm are: 0) Client sends request to primary. 1) Primary proposes a sequence number, multicasts proposal to backups 2) Each backup sanity-checks proposed sequence number, enters "prepared" states and multicasts "prepare" if acceptable. - "Sane" here mostly means no other request has been seen with the same sequence number. 3) If a backup sends a "prepare" and sees >= 2f matching "prepares" then it knows the sequence number is good and can be locally fixed. It then multicasts a "commit" message. - Since there are 2f+1 matching "prepares" we know >= f+1 non-faulty nodes agree on the ordering. By the majority property no other ordering can also be committed anywhere else. - Note that the notion of "locally fixing" here is my term, and is different from the "committed-local" predicate in the paper. 4) Once a backup sees 2f+1 "commits" (possibly, but in theory not necessarily including its own), it knows that its view of the world has been accepted by a majority of the non-faulty nodes. At this point, the predicate "committed-local" becomes true, and the replica can schedule execution of the request as soon as all earlier requests have executed 5) After executing the request, each replica independently replies to the client with the result (or we can save some bandwidth by having only the primary return the result and the backups send only a digest). 6) The client knows that the request has succeeded when it sees matching replies from f+1 replicas. Interestingly, the distinction between the "committed" predicate and the "committed-local" predicate in the paper mirrors the notion in Paxos of information known only to the Gods outside the system. "committed-local" is, of course, known within the system, but curiously it is a stronger predicate than "committed" ("committed-local" at any non-faultly replica implies "committed", but the reverse is not true). By definition, a request is "committed" at the moment that f+1 non-faulty replicas have entered the "prepared" state. At this moment the Gods know that no other request can be assigned that sequence number, so in some sense the issue is fully decided and we need only wait for everyone in the system to learn the truth. The system uses a view-change protocol to deal with failed primaries. This protocol depends liberally on the properties of public-key crypto, using in multiple places the technique of sending out a presumptive message declaring some property, along with a set of signed messages from other replicas backing up the assertion. The entire process is triggered when a replica times out waiting for a hole to be filled in in the sequence number space so that it can execute some waiting higher-numbered operation that is committed-local. To initiate a view-change, a replica proposes the new view number, offers proof of the most recent checkpoint in the sequence space, and proof of the "prepare" messages for each locally fixed request not included in the checkpoint. When the primary for the new view (whose identity is a deterministic function of the view number) sees sufficient support for the change, it issues a "new-view" message, which mentions both the checkpoint basis and the list of all requests following that checkpoint. This list includes all the locally fixed requests mentioned in the "view-change" messages, with the holes filled in with dummies. Practical issues (i.e. why this is a systems paper and not a theory paper): Note that checkpointing itself is a practical issue. In theory the view-change algorithm could always work from scratch, but then both the storage requirements (even without failures) and the communication needed to handle a view change would grow linearly with time as the system ran. Checkpointing limits both storage and view-change communication to known values. One simple attack available to a faulty primary would be to assign outrageously large sequence numbers to new requests. This would not affect the theoretical safety of the system, because a view-change will occur when the replicas can't fill the enormous hole left by the request, but it can rapidly exhaust the sequence number space, because the new primary will have to fill the hole with nulls. The system deals with this by restricting the maximum hole size using a "high-water-mark" on the sequence number space. Replicas reject outright any request with too large of a sequence number. Although the view-change algorithm requires public-key crypto (or a trusted online party to simulate public-key signatures), because it relies on messages of the form "A says B said X", the basic algorithm requires only pairwise authentication. As a result, the system is implemented with pairwise shared secrets and message authentication codes, which are substantially faster to compute. The system is not only implemented, but implemented as a library, with a simple client-side interface ("invoke") and only a slightly more complicated server-side interface (mostly a "execute" upcall, but also a bunch of midrange complication to deal with the necessary evil of checkpointing). This should mean that the system can actually be used directly, although this paper does not show enough implementation experience to really argue that goal. BFS results: Finally, the paper describes an implementation, using the replication library, of BFS, an NFS client and server that tolerates byzantine failures within the server population transparently to the clients. Using memory-mapped files for "stable" storage, they show decent enough results, suggesting that the overhead of tolerating byzantine faults is no worse than the difference between memory and disk (although stating it that way makes it sound pretty dismal indeed). It's not clear how much worse using real disks would actually make BFS though, because it seems that a lot of the disk time could be overlapped with the protocol operations, especially if we can segregate the disk so we can write speculative operations to stable storage and record later whether they were accepted through the byzantine protocol. Andy