Lecture 6: Primary-Backup replication

Agenda

  • Primary-Backup replication
    • lab 2
    • more about lab 2 in section tomorrow

Motivation for Primary-Backup

So far

  • We've talked about RPC (and you're working on lab 1)
  • Clients send commands to server, get responses
  • The server executes the requests in the order it receives them
  • We can use sequence numbers and retransmission to handle all network failures (in our failure model): drops, dups, delays, reorderings.
    • Great job tolerating all network failures in the model! Yay!
  • If the server crashes, then the system stops working. So our RPC system doesn't do a very good job at tolerating node failures at all.
    • Even recovering manually from such a crash would be difficult, if the data was only stored in memory it might just be gone.

Towards tolerating node failures

  • Goal: tolerate (some) node failures
  • Technique: replicate
    • Have multiple copies of (essentially) our RPC server, so that if some of the copies crash, the others can still make progress.

Straw proposal: Naive client-driven replication

Here is another one of James's famously bad straw proposals:

  • Clients operate similarly to lab 1 (part 3), except now there are two servers.
  • Clients submit requests to both servers
    • They wait until they hear back both responses (retransmitting as needed), then return the results (which hopefully agree!) to the application layer
  • The servers behave exactly like lab 1 (part 3), including the at-most-once stuff.
  • If one of the servers goes down, we can manually reconfigure the system to revert to using just the other server.
    • Note that we couldn't do this if we weren't sending requests to both servers, since in that case, only one of the servers might have the data.

What goes wrong?

  • Is this protocol correct? Does it handle network failures? (We said it sort of manually handles node failures.)
  • No! Suppose we have two clients that both submit a request.
    • Each client sends its request to both servers
    • If those requests arrive in different orders at the two servers, each server will happily execute them in the order it received them.
    • The server's state diverges, clients get contradictory responses, everything breaks.

It's not that surprising that this doesn't work. In this straw proposal, the servers never talk to each other!

Primary backup basics

  • Primary-backup replication will be our first step towards tolerating node failures.
    • Important to say right at the very beginning that our primary-backup protocol has many shortcomings! We will discuss these extensively, and fix them later in the quarter.
  • Basic idea of primary-backup replication is to have two servers: the primary and the backup.
    • We will design the protocol so that if either the primary or the backup crashes, clients will still be able to get a response from the system.
  • We want the system to be "equivalent" from a client's perspective, to a single copy of the application.
    • More formally, we want the system to be linearizable. We will talk more about that soon.
    • For now, one key thing we want to avoid is two different copies of the application both responding to client requests with independent and contradictory results.
    • For example, in the key-value store, if client 1 appends "x" to key "foo", and client 2 appends "y" to key "foo", then there are several allowed responses:
      • (assume that before this point, key "foo" stored no value / the empty string)
      • client 1's request gets there first, so it gets result "x". client 2's request gets there second, so it gets result "xy".
      • or the other way around: client 2's request is first, result "y"; client 1's request is second, result "yx"
      • What would not be allowed is client 1 getting result "x" and client 2 getting result "y", which would indicate that somehow both client's requests were executed "first".
        • The way this would happen is typically due to a "split brain" problem, where two copies of the application both somehow think they are "in charge" but they are operating independently and not staying in sync. So they can give contradictory answers like this because they are not even talking to each other.

State-Machine Replication

Before we get to the details of primary-backup, let's reflect some more on the way RPCs were set up in lab 1.

Input/Output State Machines (Applications in the labs)

  • In the labs, our RPC implementation allows Commands to be submitted to an Application, which returns Results.
  • This Application can also be thought of as a state machine (but a different kind than transition systems!)
    • Remember that in CS, the phrase "state machine" can mean lots of different things. Usually it means a finite state machine like a DFA or an NFA, but in this class, we will use it to mean something like an Application.
  • These kinds of state machines have a (potentially infinite) state space (e.g., a key-value store might have a Map<String,String> as its state)
  • And, they execute by taking a sequence of steps.
    • At each step, it:
      • consumes exactly one input (called a Command in the labs)
      • produces exactly one output (called a Result in the labs)
      • updates its state
  • Our RPC implementation doesn't care exactly what the Application does, as long as it has this interface.
  • We will use the phrase "input/output state machine" to refer to this kind of state machine, represented by Application in the labs.
    • Again, not to be confused with transition systems, which are similar, but don't have input/output at each step.

RPC as hosted state machines

  • In lab 1, you are implementing an RPC service that hosts an application over the network.
  • It accepts Commands from clients over the network and sends back Results also over the network
  • Separates concerns between network issues versus the Application
    • In lab 1, you are implementing a hosted key-value store.
    • But the key-value store is just an Application that doesn't know anything about the network.
      • Just define the state, Commands, and Results, and then implement the execute method.
    • And the RPC client and server, plus the AMOApplication wrapper together work to host the Application, but they don't care at all about its internals!
  • This separation is really great!
    • It means that we can swap out the Application for a different one, and all the RPC stuff will continue to work just fine to host the new Application over the network.
    • And it means we can swap out the hosting mechanism, by replacing the simple RPC service with something more sophisticated.

Primary-Backup State-Machine Replication

The basic idea

  • Our Primary-Backup system will host state machines much in the same way that the RPC service of lab 1 did.
  • Main difference is that there will be multiple copies of the Application, one for each replica in the system.
  • The next figure shows is the idea of how the primary-backup system will work in the normal case
  • (Note that we are omitting several details here that will be expanded on below. This is just to give you the idea.)
  • There are two servers, the primary and the backup. They each have a copy of the Application.
    • Clients submit requests to the primary, who replicates them to the backup before responding.
  • This basic workflow immediately raises several challenges:
    1. We now have two copies of the Application. How do we make sure that they are in sync at all times?
    2. Each request gets executed twice—once on the primary and once on the backup. How do we know the state changes and results will be the same? Also, how do we prevent global side effects of the request from occurring twice?
  • Our goal is to design a protocol that addresses these challenges. We will ensure that:
    • the backup is at most one request ahead of the primary
      • Side note: It is absolutely fundamental to distributed systems that you cannot keep to copies of some data on two different machines perfectly in sync.
        • If the primary tells the backup to execute a request, the primary cannot know exactly when the backup will execute it, because the network has arbitrary delay.
        • Since it can't know exactly when the backup will execute it, the primary has no way to execute the request at exactly the same time.
        • So they must be a little bit out sync, just because of the potential for network delay.
        • But, we can have the backup inform the primary when the request is finished executing, so that the primary can execute it and move on.
        • This puts a bound on how far out of sync the two replicas can get, and is a very common idea in distributed protocol design.
    • the primary and backup execute exactly the same sequence of requests in the same order
    • the Application is deterministic and has no global side-effects
      • ("Deterministic" means that if you execute the same request from the same state, you always get the same state changes and response. The key-value store Application is deterministic, and we will assume that all our Applications are deterministic in this class.)

Node Failures

  • Remember that our goal was to improve our RPC system by tolerating (some) node failures. How are we doing so far?
  • Well, let's consider what happens if the primary fails.
    • That was the server that clients were submitting requests to. So if that server is no longer available, we need to establish a new server as primary and tell clients to go talk to that server.
      • (In our protocol, we will always select the previous backup as the new primary. This will turn out to be crucial for our protocol's correctness.)
  • If the backup fails, then the primary won't be able to replicate anymore requests to the backup, so again clients stop making progress.
    • We need to either let the primary declare the backup dead and continue without replicating, or we need to replace the backup.
  • In either case, the basic idea is to failover to another server. ("Failover" basically means to choose a new server to play the role of the old server that failed.)
    • To do this, we need to first detect that the server has crashed, then establish a replacement, and inform everyone who needs to know to talk to the replacement instead.
  • Here is a straw proposal that does not work at all (do not implement this in lab 2!)
    • When the primary hasn't heard from the backup for a while, the primary can decide to replace it with a new backup.
    • Similarly, when the backup hasn't heard from the primary in a while, the backup can decide to become the new primary.
    • Why is this a bad idea? Suppose the network link between the primary and backup is down, so they cannot exchange messages.
      • Then neither of them will hear from the other, so they will both assume the other node has failed, even though actually both nodes are fine.
      • So the primary will pick a new backup and start accepting client requests again by replicating them to the new backup.
      • But the old backup will declare itself the new primary, select itself another backup, and also start accepting client requests.
    • This leads to a scenario where we have two completely separate primaries that are functioning independently and accepting client requests.
      • Their Applications will diverge, and the result is a system that does not at all behave like our old single RPC server.
      • In distributed systems, people often call this situation "split brain" because the system has essentially split into two independent systems that don't know about each other (which is bad, because the system is supposed to present a unified view of the Application to the clients).

  • Remember that it's a fundamental property of distributed systems that you cannot distinguish node failure from network failure. The symptoms of both failures are that you don't hear from the other node.
    • So it is impossible to accurately detect whether a node has crashed.
  • That's why it's a bad idea to allow different nodes to decide on their own that some other nodes are dead.
  • Instead, we need to find a way to tolerate failures that doesn't depend on accurately detecting whether those failures actually occurred or not.

The centralized view service

  • One way to avoid nodes disagreeing about whether other nodes have crashed is to have one node make all of those decisions on everyone's behalf. This is the approach we will take in lab 2.
  • The idea is to introduce a separate node that we will call the view server or view service, that monitors all the other nodes in the system and decides which ones of them have crashed.
    • Note that view server can be wrong about whether a node is actually up or down, but it doesn't matter. Other nodes will always take the view server's opinion about which nodes are up as if it was the truth.
  • Once the view server has an idea about which nodes are up, it can select a primary and a backup and inform everyone (including clients).
  • If a node goes down, the view server will detect it and orchestrate a failover to replace that node.
  • Notice that this solves the split brain issue above, because now all decisions about failover are centralized at the view server. Yay!
  • Unfortunately, this approach also has a major drawback, which is that if the view service crashes, the system will not be able to tell who is the primary and who is the backup, and no further failovers will be possible.
    • This is unsatisfying! But the protocol is already quite interesting and complex enough that it makes for excellent practice in lab 2.
    • In a few weeks, we will study Paxos, which is a fully general replication solution that can tolerate many node failures.
      • It is quite a bit more subtle and complicated than this protocol, so warming up with a view-service-based primary-backup protocol in lab 2 will prepare you well for lab 3.
    • It's ok to be a little disappointed in this protocol! We promise it gets more satisfying when we get to Paxos.
  • For lab 2 and the rest of our discussion of primary-backup, we will assume that the view server never crashes.
  • That said, notice that this design is already quite a bit better than our single RPC server in lab 1
    • With only one server, if that server crashed, no client could make progress.
    • But with primary-backup, we will be able to tolerate either of those nodes crashing, and clients will still be able to make progress after failover.
      • If the view server crashes, then we won't be able to failover, but we've still tolerated crashes in all but one of our nodes!