Lecture 3: Correctness of Distributed Systems

What guarantees does our RPC protocol make?

Two important properties:

  1. If a client receives a response, then that reseponse is "correct".
    • (What does "correct" mean? For now, let's assume the RPC is stateless. Then correctness means the server executed the function we asked for and sent back the right return value. We will handle stateful correctness later.)
  2. When a client sends a request, it will eventually get a response to that request.

These two properties have very different flavors. Property 1 is what is known as a "safety property", while Property 2 is a "liveness property".

When are safety and liveness properties false?

The distinction between safety and liveness is important in the theory of distributed systems. A good way to think about this distinction is to ask: what does it mean for each kind of property to be false?

  • A safety property can be violated with a finite sequence of events starting from the initial state.
  • A liveness property can only be violated by an infinite sequence of events.

The slogans are:

  • Safety: bad things never happen
  • Liveness: good things eventually happen

When are safety and liveness properties true?

We'll focus on safety.

  • To prove a safety property, you need to consider each possible finite execution and show that it does not violate safety.
  • Challenge: In all but the most trivial systems, there are infinitely many finite executions! So we can't just enumerate them and check each one.
  • Question: How can we get guarantees about all (infinitely many) finite executions by doing only a finite amount of reasoning work?
    • Answer: Mathematical proof. Specifically, induction.

Recall: Mathematical induction

  • The most basic form of induction is induction on the natural numbers, also known as "mathematical" induction.
  • Mathematical induction allows us to prove a claim of the form \(\forall n.\, P(n)\).
  • Notice this is again an infinite number of claims: \(P(0)\), \(P(1)\), \(P(2)\), etc.
  • So we can't just prove them all by enumerating all the natural numbers—there are infinitely many natural numbers!
  • Instead, induction tells us to prove
    • (base case) \(P(0)\)
    • (inductive case) If \(P(k)\), then \(P(k+1)\).
  • This finite amound of work allows to conclude the infinite number of facts \(\forall n.\, P(n)\).
    • Why? Well suppose you want to know that \(P(100)\) is true.
      • By the base case, \(P(0)\) is true.
      • Then by one application of the inductive case, it follows that \(P(1)\) is true.
      • By another application of the inductive case, it follows that \(P(2)\) is true.
      • Continuing to apply the inductive case over and over, we eventually conclude that \(P(100)\) is true.
    • Nothing special about 100. Same idea works for any natural number
    • Therefore \(P\) is true for all natural numbers.

States and Executions

  • A state is a global state (of a distributed system). This consists of:
    • The local state of each node
    • The timers at each node that have been set but not yet fired.
    • The set of all messages that have ever been sent. (Whether or not they have been received.)
  • An execution is a non-empty (finite or infinite) sequence of states where:
    • The first state in the sequence is initial.
      • A state is initial if all the node local is initialized as described by the protocol, any initial timers have been set, and the network is empty.
    • All adjacent pairs of states \(s_i\) and \(s_{i+1}\) in the sequence are related by a possible step in our network model.
      • Our network model includes:
        • Steps that drop messages.
        • (We don't need explicit steps to reorder, delay, or duplicate messages, as we will see.)
        • Steps that deliver some message that was sent.
        • Steps that fire some timer that was set.