Lecture 3: Correctness of Distributed Systems
What guarantees does our RPC protocol make?
Two important properties:
- If a client receives a response, then that reseponse is "correct".
- (What does "correct" mean? For now, let's assume the RPC is
stateless. Then correctness means the server executed the
function we asked for and sent back the right return value. We
will handle stateful correctness later.)
- When a client sends a request, it will eventually get a response to
that request.
These two properties have very different flavors. Property 1 is what
is known as a "safety property", while Property 2 is a "liveness
property".
When are safety and liveness properties false?
The distinction between safety and liveness is important in the theory
of distributed systems. A good way to think about this distinction is
to ask: what does it mean for each kind of property to be false?
- A safety property can be violated with a finite sequence of events
starting from the initial state.
- A liveness property can only be violated by an infinite sequence of
events.
The slogans are:
- Safety: bad things never happen
- Liveness: good things eventually happen
When are safety and liveness properties true?
We'll focus on safety.
- To prove a safety property, you need to consider each possible
finite execution and show that it does not violate safety.
- Challenge: In all but the most trivial systems, there are infinitely
many finite executions! So we can't just enumerate them and check each one.
- Question: How can we get guarantees about all (infinitely many) finite
executions by doing only a finite amount of reasoning work?
- Answer: Mathematical proof. Specifically, induction.
Recall: Mathematical induction
- The most basic form of induction is induction on the natural numbers,
also known as "mathematical" induction.
- Mathematical induction allows us to prove a claim of the form \(\forall n.\, P(n)\).
- Notice this is again an infinite number of claims: \(P(0)\), \(P(1)\), \(P(2)\), etc.
- So we can't just prove them all by enumerating all the natural
numbers—there are infinitely many natural numbers!
- Instead, induction tells us to prove
- (base case) \(P(0)\)
- (inductive case) If \(P(k)\), then \(P(k+1)\).
- This finite amound of work allows to conclude the infinite number of
facts \(\forall n.\, P(n)\).
- Why? Well suppose you want to know that \(P(100)\) is true.
- By the base case, \(P(0)\) is true.
- Then by one application of the inductive case, it follows that \(P(1)\) is true.
- By another application of the inductive case, it follows that \(P(2)\) is true.
- Continuing to apply the inductive case over and over, we
eventually conclude that \(P(100)\) is true.
- Nothing special about 100. Same idea works for any natural number
- Therefore \(P\) is true for all natural numbers.
States and Executions
- A state is a global state (of a distributed system). This consists of:
- The local state of each node
- The timers at each node that have been set but not yet fired.
- The set of all messages that have ever been sent. (Whether or not they have been received.)
- An execution is a non-empty (finite or infinite) sequence of states where:
- The first state in the sequence is initial.
- A state is initial if all the node local is initialized as described
by the protocol, any initial timers have been set, and the network is
empty.
- All adjacent pairs of states \(s_i\) and \(s_{i+1}\) in the sequence are
related by a possible step in our network model.
- Our network model includes:
- Steps that drop messages.
- (We don't need explicit steps to reorder, delay, or duplicate messages, as we will see.)
- Steps that deliver some message that was sent.
- Steps that fire some timer that was set.