Distributed transactions and two phase commit Transaction definition: group of operations with four properties Atomic – all or nothing Consistent – equal to some sequential order Isolation – no data races Durable – once done, stays done At a logical level, transactions appear to occur in a serial order T0, T1 ... Ti, Ti+1 where everything that transaction Ti depends on has completed in some earlier transaction. Optimistic concurrency control makes this explicit: pick a logical time to execute a transaction, abort if data had been modified in the meantime. Snapshot execution: for read only transactions, ok to run at a consistent time in the past -- e.g., between two other transactions. So far, updating state at a single server. We can use caches and cache coherence to be able to do operations more quickly, and transactions to ensure that the persistent state is updated at the server in a consistent way, despite client failures. But what if we need to update state at two servers? An example: data is stored in shards across a number of servers, but we still want serializability across the entire system. We want the state to be updated consistently, despite client and server failures. One model is based on cache coherence. Bring all data exclusively, then ok to commit. And we can commit in parallel on different machines, if working on different data; serializable transactions does not require a single central authority assigning an order. Two generals problem. Coordinate attack on a valley, using messages. But the messages are unreliable – can’t be sure they get through. Can you come up with a protocol that works? Try it: what should we send? Key point: problem occurs even though no message is in fact lost. In fact: impossible to coordinate with any finite protocol. If you could, then you didn’t need the last message. So there’s a shorter finite protocol. Take the shortest one. Then the last message wasn’t needed! Contradiction. Can generalize: impossible to coordinate two nodes to do some joint action, if nodes can fail and messages can take varying amounts of time to be delivered (e.g, because they can get dropped). Can’t tell if someone is failed, or just slow because the network is slow. So any finite coordination protocol can’t achieve both timeliness and correctness. So how are we to update state in two places in a consistent way? If we can’t coordinate simultaneous action, what can we have? Eventual agreement, provided nodes eventually recover. We could just have one side dictate the result – but then, if the other side can’t complete the transaction (e.g., it runs into a deadlock, or some other competing transaction grabbed a lock), problem! Hence, two phase commit – first, check that transaction can commit everywhere (grab locks), then one node (the coordinator) dictates the result. Coordinator: Send vote-req to participants Get replies If all yes, log commit If not, log abort notify participants at participant: wait for vote-req determine if transaction can commit locally (no deadlock, enough space, etc.) log result send result to coordinator if result is ok to commit, wait for commit/abort from coordinator log what coordinator says Walk through algorithm: what happens if failure at each step? What if we did things slightly differently? E.g., do we need the message log? What if we reply then log? What properties does 2pc have? Safety: 1. All hosts that decide reach the same decision. 2. No commit unless everyone says "yes". Liveness: 3. No failures and all "yes", then commit. 4. If failures, then repair, wait long enough, then some decision. (marriage anecdote) How well does 2PC work in practice? Not so well! That is, it is a blocking protocol – if the coordinator fails, everyone needs to wait for the coordinator to recover to discover whether the commit occurred. In essence, we’ve turned the problem of updating state in multiple locations atomically, into a single commit at the coordinator. That means we depend on the coordinator, and if it fails, we’re stuck until it recovers and can tell us what happened. Applications of two phase commit How would you build a reliable p2p email system? (Or a reliable exactly once RPC) Two ways to build this: in one, write email to server, server provides to clients Alternate: clients communicate directly with each other using 2pc walk through send mail to many users example what if one user doesn't exist? but mail has already been delivered to some other users how to un-do? what if concurrently one reader reads his/her mail? does user get tentative new mail? does reading user block? where? suppose read_mail also deletes the mail what if new mail arrives just before delete? will it be deleted but not returned? why not? what lock protects? where is the lock acquired? released? what if a user is on the list twice? locks are held until end of top-level xaction deadlock? Next question: can we design a non-blocking commit protocol? That is, even though nodes can fail and restart, and messages can be delayed and lost, we can still make progress when a node fails without waiting for it to restart? Note that Lab2 doesn't have this -- if primary fails without having ack'ed the view, or if viewserver fails, system blocks. Paxos, next week, is a non-blocking protocol. Key idea -- central idea, is idempotence -- the idea behind NFS. Each replica acts to ensure that nodes after a failure behave consistently with what might have happened before the failure. Each server replica will be identical, but together we need to have them work in concert to decide on a commit, so that we can figure out whether the commit happened, even if any node fails (or is so slow it appears to have failed). Should seem hard! But this is a key building block in almost all highly available distributed systems today.