Practical Byzantine Fault Tolerance
Castro and Liskov

The paper presents a replication algorithm that is able to tolerate
Byzantine faults, does not require a synchronous environment and has
performance comparable to existing algorithms used in the market that
don't tolerate Byzantine failures. In addition to describing the
algorithm in details, the paper describes optimizations that can
improve its performance and compares a file system developed using
this algorithm (named BFS) with existing file systems such as NFS
again in terms of performance.

The distributed system model assumed by the authors is basically a
client-server one where state-based servers have several replicas, the
environment is asynchronous, the messages are not assumed to be
delivered in any specific order and there may occur Byzantine of
fail-stop failures. Messages are encrypted to prevent spoofing,
replays and to detect corrupted messages. Since the algorithm
tolerates Byzantine failures, it's assumed that at most k replicas can
fail in a group of 3k+1 replicas. The algorithm presented provides two
properties: \textit{safety} and \textit{liveness}. The former means
that all the replicas work like a centralized implementation
(atomicity); the latter guarantees clients will eventually receive
replies from replicas. In order for that to occur it assumes some
synchrony between the client and servers communication: clients should
be able to detect whether a replica has timed out, and thus should
retry their requests. The algorithm works as follow:

* First we define two characteristics of the algorithm: primary
  replica and views. The former consists of the replica receiving the
  request from the client. The latter indicates what replica is
  currently the primary (information shared by all the replicas as
  well as the clients)

* Assuming the primary replica does not fail:

  - The client sends the request to the primary P

  - Upon receiving the request, P either multicasts it to the other
    replicas or buffers it if there are too many requests being
    processed

  - If P can multicast the request, it does so by initiating a
    three-phase protocol. The phases are pre-prepare, prepare and
    commit. 

  - During pre-prepare, P securely multicasts (using digest functions
    and messages authentication codes) the request to all the replicas 

  - After validating the correctness of the pre-prepare request (that
    is, after validating it was encrypted properly, its authentication,
    its view and its sequence number), the replicas accept it and
    multicast a prepare request to all the other replicas 

  - A replica accepts a prepare once at least 2k valid prepare
    messages are received from different replicas It then multicasts a
    commit message

  - A replica accepts a commit when at least 2k+1 valid commit
    messages have been received from different replicas 

  - At that point the replica executes the request and sends the
    result to the client 

  - The client waits for k+1 identical responses (for the same
    request) to arrive. It then concludes that was the response from the
    non-faulty replicas. 

* Assuming the primary replica fails:

  - A view change is triggered by one or more other replicas when a
    timeout happens from the primary 

  - The non-primary replica multicasts a view-change message, sending
    along with it the last stable checkpoint identified by that
    replica  

  - When the future new primary (identified by \textit{v mod (3k+1)}
    where v is the view number) receives 2k valid view-change
    requests, it multicasts a new-view request along with a set of
    pre-prepare messages with sequence number later than the last
    stable checkpoint

  - Replicas accept the new-view request (if it's correct) and
    multicasts a prepare request for each pre-prepare request sent by
    the new replica

  - From this point on the algorithm proceeds the normal flow with the
    new selected primary 

The paper proves the correctness of this algorithm and provides
several optimizations such as using message authentication codes
instead of digital signatures to boost its performance. In order to
compare this algorithm with existing replication algorithms, the
authors implement a file system and compare it with NFS standard
(which does not tolerate Byzantine fails), coming to a conclusion that
their algorithm takes only 3\% more time to run then NFS.

The paper does not take into account network partitions and assumes
independent nodes failure. It does not suggest approaches for dealing
with such scenarios either. It's also not clear how the first primary
is decided by the system in case of (for instance) a complete loss of
state by all the replicas, and what happens if the primary presents
Byzantine fails -- the paper only mentions that view changes happens
when the ``primary fails'', but it does not mention what kind of
failure this could be.

The paper presents a new algorithm that can tolerate Byzantine fails
in an asynchronous environment. It still relies on synchrony to
provide liveness, otherwise it would be able to solve the consensus
problem in an asynchronous environment, which is impossible. This
result is important in today's world where environments like the
Internet are asynchronous and susceptible to Byzantine failures.