Implementing Fault-Tolerant Serives Using the  State Machine Approach:
A Tutorial
Schneider

Whew.  My initial reaction to this work is that it probably tries to
take on too large a chunk of the space to effectively work with.  The
most obbvious example of this is the continued parallel structure
between the byzantine and fail-stop models.  Fortunately most of this
can be factored out (e.g. in most cases the difference between
byzantine and fail-stop is 2n+1 with voting vs. n+1 with
any-response), but it's a lot to chew at once.

The high-level goal is to be able to keep a set of replicated state
machines operating together, so that the system as a whole can
tolerate failures (both kinds) of some of the replicas.  There is an
interesting interplay between several kinds of entities in the system,
including servers (the relatively simple replicated state machines),
clients (where there are restrictions on replication), sensors (which
read real-world values and tend to screw everything up as a result),
and actuators (which affect the real world, and for which it can be
very difficult to handle failures).  As near as I can figure, the work
is really focused on the servers, and everything else is tossed in for
completeness, but most of it I had to ignore to get at the real meat,
either because I felt the issues were vacuous, or better described as
assumptions.  But the basic algorithm as it keeps the servers together
while ignoring the implications of client failures and real-world
interactions is interesting in how it ties together lots of other work
we've seen.

We start by defining the (obvious in retrospect) conditions under
which a set of state machines will compute in lockstep:

"Agreement": every replica sees the same requests, and
"Order": those requests are processed in the same order

Note that all bets are off for faulty replicas, because it isn't even
worth thinking about the requests they see or the order they process
them in; by definition a faulty replica isn't even running the proper
state machine.

The real observation is that, if we aren't too concerned with
efficiency and can make the required assumptions, we know how to
achieve both of these conditions.  The solutions to the byzantine
generals problem allow us to achieve agreement, and Lamport's logical
clocks allow for order (he also examines a couple of other schemes
using loosely-synchronized real-time clocks).  Of course the
efficiency and assumptions issues cannot be swept aside, especially
for the byzantine generals solutions, which assume bounded message
delay, and even so require an outrageous amount of traffic.  But in
return we get some much stronger results.

On that subject, I thought the discussion of the relative merits of
t-fault toleratance and MTBF as metrics of fault-tolerance was an
important thing to understand.  In many ways MTBF is the typical
systems approach, in that it is something that can be (more or less)
measured for a system and that it describes a end-user notion of how
reliable the system is.  t-fault tolerance, on the other hand, is
useful precisely because it factors out the whole issue of how likely
failures are, with two important consequences.  First, as Schneider
points out, it characterizes a property of the *system* itself,
independent of the pieces from which it is built.  But it is also much
more useful than MTBF in guiding design.  From t-fault tolerance we
learn how reliable the pieces need to be to achieve any user specified
reliability, and gain much more insight into how to make the system
more reliable by either metric.

I did find the whole discussion of faulty clients, both in their roles
as readers of outputs and generators of requests, to be rather
vacuous.  As I understand it, a voting system fundamentally depends on
a non-faulty voter somewhere.  Voters inside the system can be
replicated, but eventually some client or output device has to add up
the final tally (from either the replicated state machines or the
replicated voters), and the only guarantee we get is fate-sharing
(which is good enough).  Similarly, if our definition of correctness
requires preventing faulty clients from issuing bad requests (where a
request is "bad" not because of some intrinsic property but simply
because a non-faulty client would not have issued it at that moment),
then it seems like were pretty well stuck.  The paper discusses
replicating the client, but this is likely either totally impossible
(because the system requires that the client be in one place tied to
an individual principal) or essentially impossible (because the client
has to generate requests based on looking at the real world, and
therefore the replicas cannot be assumed to generate identical request
streams).  The only other solution (which is what we typically see
done) is to define away the issue by implimenting access control, and
defining correct behavior to be what the clients asked for, as long as
it obeys the access control list (the "defensive programming" section
is actually a little stronger, because it also allows for more
arbirary application-specific rules to be enforced by the servers, but
application-specific is important).

Finally, the paper looks a reconfiguration, which is the design of
self-healing systems, where failed nodes are isolated, removed, and
replaced with new nodes to keep the system running.  Properly
designed, such a system could be described in terms of the *rate* of
failures it can tolerate, rather than a count of failures.  The basic
tricks are recognizing failures in the byzantine case (be careful that
a faulty node can't implicate any (or maybe too many) non-faulty nodes
and have them removed), and bringing new nodes into the system so that
no requests are missed.

It seems to me that the relevance today has ended up being tied to the
special cases.  Most focus is still on fail-stop cases, and typically
using appliction semantics (particularly commutativity of requests and
read-only requests, as mentioned) to reduce the overhead required to
achieve consistency.  Using this we can tell about how big a hit we'll
take, both in time and extra nodes.


Andy