Practical Byzantine Fault Tolerance
Castro and Liskov

Basic idea: take the essence of the Byzantine Generals algorithm and
build an acceptably efficient client-server system that can survive
Byzantine failures among the servers.  For reasons similar to those in
the Schneider paper, the notion of surviving Byzantine client failures
is ill-defined, and all that we can really try to do is carefully
define what a faulty client can and cannot do to the system.
Fundamentally, the algorithm is yet another incarnation of a
replicated state machine, where voting is used to prevent a small
number of faulty nodes from overriding the majority (it requires the
same 3f+1 total nodes to tolerate f faulty nodes as before), and
establishes a total order on the events in the system to keep all the
non-faulty replicas in lockstep.

Unlike most of the earlier work, this paper assumes a network with
potentially unbounded delay.  The algorithm guarantees safety even in
the face of such an asynchronous network, but does not (and cannot)
guarantee liveness unless some upper bound can be placed on network
delay.  This network model also requires the use of encryption to
establish authentication, but since this is orthogonal to the main
protocol I will treat it separately below.

As in the Paxos algorithm, there is the notion of a normal-case
behavior, where a "primary" node coordinates ordering and manages the
consensus algorithm, and a bad-case (typically used only if the
primary becomes faulty), where nodes communicate in a more expensive
peer-to-peer style protocol to ensure safety.

What I have always liked about this work is that the algorithm is
simple, straightforward, and easy to describe.  Unlike the recursive
Byzantine Generals algorithm, it is easy to see how many steps and
messages are required for each operation.  The basic steps of the
algorithm are:

0) Client sends request to primary.
1) Primary proposes a sequence number, multicasts proposal to backups
2) Each backup sanity-checks proposed sequence number, enters
   "prepared" states and multicasts "prepare" if acceptable.
   - "Sane" here mostly means no other request has been seen with the
     same sequence number.
3) If a backup sends a "prepare" and sees >= 2f matching "prepares"
   then it knows the sequence number is good and can be locally
   fixed.  It then multicasts a "commit" message.
   - Since there are 2f+1 matching "prepares" we know >= f+1
     non-faulty nodes agree on the ordering.  By the majority property
     no other ordering can also be committed anywhere else.
   - Note that the notion of "locally fixing" here is my term, and
     is different from the "committed-local" predicate in the paper.
4) Once a backup sees 2f+1 "commits" (possibly, but in theory not
   necessarily including its own), it knows that its view of the world
   has been accepted by a majority of the non-faulty nodes.  At this
   point, the predicate "committed-local" becomes true, and the
   replica can schedule execution of the request as soon as all
   earlier requests have executed
5) After executing the request, each replica independently replies to
   the client with the result (or we can save some bandwidth by having
   only the primary return the result and the backups send only a
   digest).
6) The client knows that the request has succeeded when it sees
   matching replies from f+1 replicas.

Interestingly, the distinction between the "committed" predicate and
the "committed-local" predicate in the paper mirrors the notion in
Paxos of information known only to the Gods outside the system.
"committed-local" is, of course, known within the system, but
curiously it is a stronger predicate than "committed"
("committed-local" at any non-faultly replica implies "committed", but
the reverse is not true).  By definition, a request is "committed" at
the moment that f+1 non-faulty replicas have entered the "prepared"
state.  At this moment the Gods know that no other request can be
assigned that sequence number, so in some sense the issue is fully
decided and we need only wait for everyone in the system to learn the
truth.


The system uses a view-change protocol to deal with failed primaries.
This protocol depends liberally on the properties of public-key
crypto, using in multiple places the technique of sending out a
presumptive message declaring some property, along with a set of
signed messages from other replicas backing up the assertion.  The
entire process is triggered when a replica times out waiting for a
hole to be filled in in the sequence number space so that it can
execute some waiting higher-numbered operation that is
committed-local.  To initiate a view-change, a replica proposes the
new view number, offers proof of the most recent checkpoint in the
sequence space, and proof of the "prepare" messages for each locally
fixed request not included in the checkpoint.  When the primary for
the new view (whose identity is a deterministic function of the view
number) sees sufficient support for the change, it issues a "new-view"
message, which mentions both the checkpoint basis and the list of all
requests following that checkpoint.  This list includes all the
locally fixed requests mentioned in the "view-change" messages, with
the holes filled in with dummies.


Practical issues (i.e. why this is a systems paper and not a theory
paper):

Note that checkpointing itself is a practical issue.  In theory the
view-change algorithm could always work from scratch, but then both
the storage requirements (even without failures) and the communication
needed to handle a view change would grow linearly with time as the
system ran.  Checkpointing limits both storage and view-change
communication to known values.

One simple attack available to a faulty primary would be to assign
outrageously large sequence numbers to new requests.  This would not
affect the theoretical safety of the system, because a view-change
will occur when the replicas can't fill the enormous hole left by the
request, but it can rapidly exhaust the sequence number space, because
the new primary will have to fill the hole with nulls.  The system
deals with this by restricting the maximum hole size using a
"high-water-mark" on the sequence number space.  Replicas reject
outright any request with too large of a sequence number.

Although the view-change algorithm requires public-key crypto (or a
trusted online party to simulate public-key signatures), because it
relies on messages of the form "A says B said X", the basic algorithm
requires only pairwise authentication.  As a result, the system is
implemented with pairwise shared secrets and message authentication
codes, which are substantially faster to compute.

The system is not only implemented, but implemented as a library, with
a simple client-side interface ("invoke") and only a slightly more
complicated server-side interface (mostly a "execute" upcall, but also
a bunch of midrange complication to deal with the necessary evil of
checkpointing).  This should mean that the system can actually be used
directly, although this paper does not show enough implementation
experience to really argue that goal.


BFS results:

Finally, the paper describes an implementation, using the replication
library, of BFS, an NFS client and server that tolerates byzantine
failures within the server population transparently to the clients.
Using memory-mapped files for "stable" storage, they show decent
enough results, suggesting that the overhead of tolerating byzantine
faults is no worse than the difference between memory and disk
(although stating it that way makes it sound pretty dismal indeed).
It's not clear how much worse using real disks would actually make BFS
though, because it seems that a lot of the disk time could be
overlapped with the protocol operations, especially if we can
segregate the disk so we can write speculative operations to stable
storage and record later whether they were accepted through the
byzantine protocol.


Andy