Practical Byzantine Fault Tolerance Castro and Liskov The paper presents a replication algorithm that is able to tolerate Byzantine faults, does not require a synchronous environment and has performance comparable to existing algorithms used in the market that don't tolerate Byzantine failures. In addition to describing the algorithm in details, the paper describes optimizations that can improve its performance and compares a file system developed using this algorithm (named BFS) with existing file systems such as NFS again in terms of performance. The distributed system model assumed by the authors is basically a client-server one where state-based servers have several replicas, the environment is asynchronous, the messages are not assumed to be delivered in any specific order and there may occur Byzantine of fail-stop failures. Messages are encrypted to prevent spoofing, replays and to detect corrupted messages. Since the algorithm tolerates Byzantine failures, it's assumed that at most k replicas can fail in a group of 3k+1 replicas. The algorithm presented provides two properties: \textit{safety} and \textit{liveness}. The former means that all the replicas work like a centralized implementation (atomicity); the latter guarantees clients will eventually receive replies from replicas. In order for that to occur it assumes some synchrony between the client and servers communication: clients should be able to detect whether a replica has timed out, and thus should retry their requests. The algorithm works as follow: * First we define two characteristics of the algorithm: primary replica and views. The former consists of the replica receiving the request from the client. The latter indicates what replica is currently the primary (information shared by all the replicas as well as the clients) * Assuming the primary replica does not fail: - The client sends the request to the primary P - Upon receiving the request, P either multicasts it to the other replicas or buffers it if there are too many requests being processed - If P can multicast the request, it does so by initiating a three-phase protocol. The phases are pre-prepare, prepare and commit. - During pre-prepare, P securely multicasts (using digest functions and messages authentication codes) the request to all the replicas - After validating the correctness of the pre-prepare request (that is, after validating it was encrypted properly, its authentication, its view and its sequence number), the replicas accept it and multicast a prepare request to all the other replicas - A replica accepts a prepare once at least 2k valid prepare messages are received from different replicas It then multicasts a commit message - A replica accepts a commit when at least 2k+1 valid commit messages have been received from different replicas - At that point the replica executes the request and sends the result to the client - The client waits for k+1 identical responses (for the same request) to arrive. It then concludes that was the response from the non-faulty replicas. * Assuming the primary replica fails: - A view change is triggered by one or more other replicas when a timeout happens from the primary - The non-primary replica multicasts a view-change message, sending along with it the last stable checkpoint identified by that replica - When the future new primary (identified by \textit{v mod (3k+1)} where v is the view number) receives 2k valid view-change requests, it multicasts a new-view request along with a set of pre-prepare messages with sequence number later than the last stable checkpoint - Replicas accept the new-view request (if it's correct) and multicasts a prepare request for each pre-prepare request sent by the new replica - From this point on the algorithm proceeds the normal flow with the new selected primary The paper proves the correctness of this algorithm and provides several optimizations such as using message authentication codes instead of digital signatures to boost its performance. In order to compare this algorithm with existing replication algorithms, the authors implement a file system and compare it with NFS standard (which does not tolerate Byzantine fails), coming to a conclusion that their algorithm takes only 3\% more time to run then NFS. The paper does not take into account network partitions and assumes independent nodes failure. It does not suggest approaches for dealing with such scenarios either. It's also not clear how the first primary is decided by the system in case of (for instance) a complete loss of state by all the replicas, and what happens if the primary presents Byzantine fails -- the paper only mentions that view changes happens when the ``primary fails'', but it does not mention what kind of failure this could be. The paper presents a new algorithm that can tolerate Byzantine fails in an asynchronous environment. It still relies on synchrony to provide liveness, otherwise it would be able to solve the consensus problem in an asynchronous environment, which is impossible. This result is important in today's world where environments like the Internet are asynchronous and susceptible to Byzantine failures.