Fault Tolerant services using state machine approach This papaer talks mainly of distributed systems as client server systems and how to implement the fault tolerance using state machine approach. The instance of state machine is defined as a state (from a fixed set of states) and an associated data. The state transition and the output of a state machine should only depend on the input. This may not be true in real time systems. The faults are categorized into Fail Stop and Byzentine. Fail stop is detectable by other components of the system and Byzentine faults are difficult to detect. To achieve t fault toleranant state machine there has to at least t+1 replicas of the state machines to deal only with fail stop faults and 2t+1 replicas to deal with Byzentine failure. All the replicas need to be in the same state so that if one of them go bad, another can immidiately replace it. For all the replicas to be in the same state all of them need to get the same set of requests and process them in same relative order. The ordering of events can be implemented by using a an arbitrator which is used by every body to get the next sequence number and the requests sent (events) to a state machine are executed in the order of the sequence numbers. The ordering of events can also be implemented by using logical clock proposed by lamport. The advantage of logical clock is, the implementation of ordering is ditributed. A request is considered stable at a replica, if every non faulty processor has sent a request with higher sequence number. For example, if a replica gets a message with sequence number 10 and then it gets messages from all non faulty processors with sequence number greater than 10, the request with sequence number 10 is considered as stable, because it can not be the last message from a stop failed processor. Real time clocks can be used generate unique identifiers if the clocks ticks are faster than the rate at which events happen on a given processor and the time taken to propagate a message between two processors is greater than the drift between the clocks of those two processors. The general idea behind handling stop failuers are, if one gets an output from any of the state machines it is considered as the correct answer. The geral idea behind handling byzentine failures is, one has to get identical output fom t+1 state machines to decide on the correct output. For handling client failures and preventing one faulty client to corrupt the state of the state machine thus causing a chain of failures can be prevented if the clients are replicated. Replicating clients results in a need for the state machine to handle multiple requests from one client instead of one request. These requests will differ in the sequence number and might also differ in the content. To convert many requests from the same client into one request a voter should be running on behalf of each replica of the state machine. The consolidation of different requests into one request by the voter is easier said than done. In practise it is often the case that, state machine puts cirtain restrictions which prevents a faulty client to send a wrong message and corrupt the state of the state machine. State machine can assume an action on behalf of a client if the client does not transmit the expected message in that interval of time. If the faulty state machines or clients can be detected, they should be removed and the repaired ones should be added. for a t fault tolerant system the invarient p(t) - f(t) > t/2 for byzentine fault tolerance and p(t) - f(t) > 0 for stop failures. It is an interesting observation that the first constraint can be realized by removing faulty processors and the second can only be realized by adding the repaired processors.