Fault Tolerant services using state machine approach


This papaer talks mainly of distributed systems as client server systems and 
how to implement the fault tolerance using state machine approach. The 
instance of state machine is defined as a state (from a fixed set of states) 
and an associated data. The state transition and the output of a state 
machine should only depend on the input. This may not be true in real time 
systems.

The faults are categorized into Fail Stop and Byzentine. Fail stop is 
detectable by other components of the system and Byzentine faults are 
difficult to detect. To achieve t fault toleranant state machine there has 
to at least t+1 replicas of the state machines to deal only with fail stop 
faults and 2t+1 replicas to deal with Byzentine failure. All the replicas 
need to be in the same state so that if one of them go bad, another can 
immidiately replace it. For all the replicas to be in the same state all of 
them need to get the same set of requests and process them in same relative 
order.

The ordering of events can be implemented by using a an arbitrator which is 
used by every body to get the next sequence number and the requests sent 
(events) to a state machine are executed in the order of the sequence 
numbers. The ordering of events can also be implemented by using logical 
clock proposed by lamport. The advantage of logical clock is, the 
implementation of ordering is ditributed. A request is considered stable at 
a replica, if every non faulty processor has sent a request with higher 
sequence number. For example, if a replica gets a message with sequence 
number 10 and then it gets messages from all non faulty processors with 
sequence number greater than 10, the request with sequence number 10 is 
considered as stable, because it can not be the last message from a stop 
failed processor.

Real time clocks can be used generate unique identifiers if the clocks ticks 
are faster than the rate at which events happen on a given processor and the 
time taken to propagate a message between two processors is greater than the 
drift between the clocks of those two processors.

The general idea behind handling stop failuers are, if one gets an output 
from any of the state machines it is considered as the correct answer. The 
geral idea behind handling byzentine failures is, one has to get identical 
output fom t+1 state machines to decide on the correct output.

For handling client failures and preventing one faulty client to corrupt the 
state of the state machine thus causing a chain of failures can be prevented 
if the clients are replicated. Replicating clients results in a need for the 
state machine to handle multiple requests from one client instead of one 
request. These requests will differ in the sequence number and might also 
differ in the content. To convert many requests from the same client into 
one request a voter should be running on behalf of each replica of the state 
machine. The consolidation of different requests into one request by the 
voter is easier said than done. In practise it is often the case that, state 
machine puts cirtain restrictions which prevents a faulty client to send a 
wrong message and corrupt the state of the state machine. State machine can 
assume an action on behalf of a client if the client does not transmit the 
expected message in that interval of time.

If the faulty state machines or clients can be detected, they should be 
removed and the repaired ones should be added. for a t fault tolerant system 
the invarient p(t) - f(t) > t/2 for byzentine fault tolerance and p(t) - 
f(t) > 0 for stop failures. It is an interesting observation that the first 
constraint can be realized by removing faulty processors and the second can 
only be realized by adding the repaired processors.