Implementing Fault-Tolerant Serives Using the State Machine Approach: A Tutorial Schneider Whew. My initial reaction to this work is that it probably tries to take on too large a chunk of the space to effectively work with. The most obbvious example of this is the continued parallel structure between the byzantine and fail-stop models. Fortunately most of this can be factored out (e.g. in most cases the difference between byzantine and fail-stop is 2n+1 with voting vs. n+1 with any-response), but it's a lot to chew at once. The high-level goal is to be able to keep a set of replicated state machines operating together, so that the system as a whole can tolerate failures (both kinds) of some of the replicas. There is an interesting interplay between several kinds of entities in the system, including servers (the relatively simple replicated state machines), clients (where there are restrictions on replication), sensors (which read real-world values and tend to screw everything up as a result), and actuators (which affect the real world, and for which it can be very difficult to handle failures). As near as I can figure, the work is really focused on the servers, and everything else is tossed in for completeness, but most of it I had to ignore to get at the real meat, either because I felt the issues were vacuous, or better described as assumptions. But the basic algorithm as it keeps the servers together while ignoring the implications of client failures and real-world interactions is interesting in how it ties together lots of other work we've seen. We start by defining the (obvious in retrospect) conditions under which a set of state machines will compute in lockstep: "Agreement": every replica sees the same requests, and "Order": those requests are processed in the same order Note that all bets are off for faulty replicas, because it isn't even worth thinking about the requests they see or the order they process them in; by definition a faulty replica isn't even running the proper state machine. The real observation is that, if we aren't too concerned with efficiency and can make the required assumptions, we know how to achieve both of these conditions. The solutions to the byzantine generals problem allow us to achieve agreement, and Lamport's logical clocks allow for order (he also examines a couple of other schemes using loosely-synchronized real-time clocks). Of course the efficiency and assumptions issues cannot be swept aside, especially for the byzantine generals solutions, which assume bounded message delay, and even so require an outrageous amount of traffic. But in return we get some much stronger results. On that subject, I thought the discussion of the relative merits of t-fault toleratance and MTBF as metrics of fault-tolerance was an important thing to understand. In many ways MTBF is the typical systems approach, in that it is something that can be (more or less) measured for a system and that it describes a end-user notion of how reliable the system is. t-fault tolerance, on the other hand, is useful precisely because it factors out the whole issue of how likely failures are, with two important consequences. First, as Schneider points out, it characterizes a property of the *system* itself, independent of the pieces from which it is built. But it is also much more useful than MTBF in guiding design. From t-fault tolerance we learn how reliable the pieces need to be to achieve any user specified reliability, and gain much more insight into how to make the system more reliable by either metric. I did find the whole discussion of faulty clients, both in their roles as readers of outputs and generators of requests, to be rather vacuous. As I understand it, a voting system fundamentally depends on a non-faulty voter somewhere. Voters inside the system can be replicated, but eventually some client or output device has to add up the final tally (from either the replicated state machines or the replicated voters), and the only guarantee we get is fate-sharing (which is good enough). Similarly, if our definition of correctness requires preventing faulty clients from issuing bad requests (where a request is "bad" not because of some intrinsic property but simply because a non-faulty client would not have issued it at that moment), then it seems like were pretty well stuck. The paper discusses replicating the client, but this is likely either totally impossible (because the system requires that the client be in one place tied to an individual principal) or essentially impossible (because the client has to generate requests based on looking at the real world, and therefore the replicas cannot be assumed to generate identical request streams). The only other solution (which is what we typically see done) is to define away the issue by implimenting access control, and defining correct behavior to be what the clients asked for, as long as it obeys the access control list (the "defensive programming" section is actually a little stronger, because it also allows for more arbirary application-specific rules to be enforced by the servers, but application-specific is important). Finally, the paper looks a reconfiguration, which is the design of self-healing systems, where failed nodes are isolated, removed, and replaced with new nodes to keep the system running. Properly designed, such a system could be described in terms of the *rate* of failures it can tolerate, rather than a count of failures. The basic tricks are recognizing failures in the byzantine case (be careful that a faulty node can't implicate any (or maybe too many) non-faulty nodes and have them removed), and bringing new nodes into the system so that no requests are missed. It seems to me that the relevance today has ended up being tied to the special cases. Most focus is still on fail-stop cases, and typically using appliction semantics (particularly commutativity of requests and read-only requests, as mentioned) to reduce the overhead required to achieve consistency. Using this we can tell about how big a hit we'll take, both in time and extra nodes. Andy