Distributed transactions and two phase commit

Transaction definition: group of operations with four properties

Atomic – all or nothing
Consistent – equal to some sequential order
Isolation – no data races
Durable – once done, stays done

At a logical level, transactions appear to occur in a serial order
T0, T1 ... Ti, Ti+1
where everything that transaction Ti depends on has completed in some earlier 
transaction.

Optimistic concurrency control makes this explicit: pick a logical time to execute a transaction, abort if data had been modified in the meantime.

Snapshot execution: for read only transactions, ok to run at a consistent time in the past -- e.g., between two other transactions.

So far, updating state at a single server.  We can use caches and 
cache coherence to be able to do operations more quickly, and 
transactions to ensure that the persistent state is updated at 
the server in a consistent way, despite client failures.

But what if we need to update state at two servers?  
An example: data is stored in shards across a number of servers, 
but we still want serializability across the entire system.  We want the state to be updated consistently, despite client and server failures.

One model is based on cache coherence. 
Bring all data exclusively, then ok to commit. 
And we can commit in parallel on different machines, if working on 
different data; serializable transactions does not require a single central 
authority assigning an order.


Two generals problem.  Coordinate attack on a valley, using messages.  But the messages are unreliable – can’t be sure they get through.

Can you come up with a protocol that works?

Try it:  what should we send?

Key point: problem occurs even though no message is in fact lost.

In fact: impossible to coordinate with any finite protocol.  
If you could, then you didn’t need the last message.  
So there’s a shorter finite protocol.  
Take the shortest one.  
Then the last message wasn’t needed!  Contradiction.

Can generalize: impossible to coordinate two nodes to do some 
joint action, if nodes can fail and messages can take varying amounts 
of time to be delivered (e.g, because they can get dropped).  
Can’t tell if someone is failed, or just slow because the network is slow.
So any finite coordination protocol can’t achieve both timeliness
and correctness.

So how are we to update state in two places in a consistent way?

If we can’t coordinate simultaneous action, what can we have?

Eventual agreement, provided nodes eventually recover. 

We could just have one side dictate the result – but then, if the 
other side can’t complete the transaction (e.g., it runs into a deadlock,
or some other competing transaction grabbed a lock), problem!   
Hence, two phase commit – first, check that transaction can commit 
everywhere (grab locks), then one node (the coordinator) dictates the result.

Coordinator:

Send vote-req to participants
Get replies
If all yes, log commit 
If not, log abort 
notify participants

at participant:

wait for vote-req
determine if transaction can commit locally (no deadlock, enough space, etc.)
log result
send result to coordinator
if result is ok to commit, wait for commit/abort from coordinator
log what coordinator says

Walk through algorithm: what happens if failure at each step?

What if we did things slightly differently?  E.g., do we need the message log?  What if we reply then log?

What properties does 2pc have?
Safety:
  1. All hosts that decide reach the same decision.
  2. No commit unless everyone says "yes".
Liveness:
  3. No failures and all "yes", then commit.
  4. If failures, then repair, wait long enough, then some decision.

 (marriage anecdote)

How well does 2PC work in practice?   Not so well! 

That is, it is a blocking protocol – if the coordinator fails, 
everyone needs to wait for the coordinator to recover to discover 
whether the commit occurred. 

In essence, we’ve turned the problem of updating state in multiple 
locations atomically, into a single commit at the coordinator.  
That means we depend on the coordinator, and if it fails, we’re stuck 
until it recovers and can tell us what happened.

Applications of two phase commit

How would you build a reliable p2p email system? 
(Or a reliable exactly once RPC)

Two ways to build this: in one, write email to server, 
server provides to clients
Alternate: clients communicate directly with each other using 2pc
 
 walk through send mail to many users example
  what if one user doesn't exist?
    but mail has already been delivered to some other users
    how to un-do?
 what if concurrently one reader reads his/her mail?
    does user get tentative new mail?
    does reading user block? where?
  suppose read_mail also deletes the mail
    what if new mail arrives just before delete?
    will it be deleted but not returned?
    why not? what lock protects? where is the lock acquired? released?
  what if a user is on the list twice?
    locks are held until end of top-level xaction
    deadlock?

Next question: can we design a non-blocking commit protocol?  
That is, even though nodes can fail and restart, and messages can 
be delayed and lost, we can still make progress when a node 
fails without waiting for it to restart?

Note that Lab2 doesn't have this -- if primary fails without having ack'ed the view, or if viewserver fails, system blocks. 

Paxos, next week, is a non-blocking protocol.
Key idea -- central idea, is idempotence -- the idea behind NFS.
Each replica acts to ensure that nodes after a failure behave consistently
with what might have happened before the failure.

Each server replica will be identical, but together we need to have 
them work in concert to decide on a commit, so that we can figure out 
whether the commit happened, even if any node fails 
(or is so slow it appears to have failed).

Should seem hard!  But this is a key building block in almost all highly available distributed systems today.