The CAP theorem

Can't have all three of: consistency, availability, tolerance to
partitions


CAP theorem was
 - proposed by Eric Brewer in a keynote in 2000
   - as a conjecture on how to build reliable distributed systems
 - later proven by Gilbert & Lynch [2002]
   - actually making it a theorem
   - but with a specific set of definitions that don't necessarily
     match what you'd assume (or Brewer meant!)
     - i.e., much less general
 - really influential on the design of NoSQL systems
 - and really controversial
   - Stonebraker: the CAP theorem, therefore, encourages engineers to
     make awful decisions. 
 - usually misinterpreted!

Usual misinterpretation
  - pick any two: consistency, availability, partition tolerance
  - then, usually: I want my system to be available, so consistency
    has to go
  - or, alternatively: I need my system to be consistent, so it's not
    going to be available
  - three possibilities: CP, AP, CA systems

First problem:
  - what does it mean to choose or not choose partition tolerance?
  - it's a property of the environment, other two are goals
  - in other words, what's the difference between a "CA" and "CP"
    system? both give up availability on a partition!
  - note that Brewer's initial statement claimed that this was OK
    (single-site systems)
  - better phrasing: if the network can have partitions, do we give up
    on consistency or availability?

Other problems:
  - is it really black and white?
  - what does (not) providing consistency mean? What about weak
    consistency level?
  - what does not providing availability mean? does that mean the
    system is always down?
  - what if I don't care about network partitions? they're really
    rare, right
      - alternatively, if this only applies during network partitions,
        what about the rest of the time?

So let's look at a more precise statement:
  - if the network is subject to communication failures
    - meaning messages may be delayed and sometimes lost forever
    - includes the possibilty of a partition
  - then it is impossible to implement a linearizable service
    - in the paper, "atomic read/write shared memory"
  - that guarantees a response to every request
    - i.e., any node can respond

Proving this statement
 - not especially surprising!
 - if there are two nodes, A and B, and they can't communicate
 - then a write on A can't affect B
 - but subsequent reads on B need to see A's result

Isn't this just FLP?
 - or is one reducible to the other
 - same flavor of argument, but important differences!
 - CAP has more network failures: can drop packets arbitrarily
   - FLP: can just delay packet delivery
 - CAP has stronger availability requirement: every node must be able
    to participate
    - FLP: failed nodes don't need to reach consensus
 - so FLP is a stronger result (assumes less about network failures,
   requires less availability)
 - in other words, FLP should be more surprising than CAP

Where do systems we've seen so far fall in?
 - lab2 - consistency; availability fails under lots of conditions
 - Paxos - consistency
   - when does availability fail? if more than half of nodes go down
 - Chubby - consistency, same answer
 - Spanner - consistency,
 - Dynamo - availability, consistency can fail even in the normal case

But Paxos/Spanner/Chubby are designed to be highly available
 - remain available as long as a majority can communicate
 - this is not availability in CAP terms, though: that means *any*
   node can process requests even when partioned
 - so possible that CAP availability is not what we really want?
   - as long as we have a majority online, can just redirect clients
     to nodes in that majority
   - on the other hand, still have to decide what happens when they
     can't communicate

Do we really care about partitions?
 - Mike Stonebraker says we shouldn't, they don't happen often
 - "it doesn’t much matter what you do when confronted with network
    partitions"
 - Do you agree?
 - note that partition here just means one node is unable to
   communicate with others
     - not that the system is split neatly in half

OK, but partitions are rare
 - when the system is not partitioned, can we have consistency and
   availability
 - as far as the theorem is concerned, sure!
 - in practice?
   - systems that give up availability usually only fail when there's
     a partition
   - systems that give up consistency usually do this all the time
   - why?
      - is "almost" consistent useful?
      - performance!

Another "P": performance
 - providing strong consistency means coordinating across replicas
 - besides partitions, also means expensive latency cost
 - at least some operations must incur the cost of a wide-area RTT
 - can do better with weak consistency: only apply writes locally
   - then propagate asynchronously


CAP implications
 - can't have consistency when:
   - want the system to be always online
   - need to support disconnected operation
   - need faster replies than majority RTT
 - in practice: can have consistency and availability together under
   realistic failure conditions
   - a majority of nodes are up and can communicate
   - can redirect clients to that majority