Week 2: Primary-backup replication

Primary-backup

Motivation for Primary-Backup

  • We've talked about RPC (and you've implemented it in lab 1)
  • Clients send commands to server, get responses
  • The server executes the requests in the order it receives them
  • We can use sequence numbers and retransmission to handle all network failures (in our failure model): drops, dups, delays, reorderings.
    • Great job tolerating all network failures in the model! Yay!
  • If the server crashes, then the system stops working. So our RPC system doesn't do a very good job at tolerating node failures at all.
    • Even recovering manually from such a crash would be difficult, if the data was only stored in memory it might just be gone.

Towards tolerating node failures

  • Goal: tolerate (some) node failures
  • Technique: replicate
    • Have multiple copies of (essentially) our RPC server, so that if some of the copies crash, the others can still make progress.

Straw proposal: Naive client-driven replication

Here is another one of James's famously bad straw proposals:

  • Clients operate similarly to lab 1 (part 3), except now there are two servers.
  • Clients submit requests to both servers
    • They wait until they hear back both responses (retransmitting as needed), then return the results (which hopefully agree!) to the application layer
  • The servers behave exactly like lab 1 (part 3), including the at-most-once stuff.
  • If one of the servers goes down, we can manually reconfigure the system to revert to using just the other server.
    • Note that we couldn't do this if we weren't sending requests to both servers, since in that case, only one of the servers might have the data.

What goes wrong?

  • Is this protocol correct? Does it handle network failures? (We said it sort of manually handles node failures.)
  • No! Suppose we have two clients that both submit a request.
    • Each client sends its request to both servers
    • If those requests arrive in different orders at the two servers, each server will happily execute them in the order it received them.
    • The server's state diverges, clients get contradictory responses, everything breaks.

It's not that surprising that this doesn't work. In this straw proposal, the servers never talk to each other!

Primary backup basics

  • Primary-backup replication will be our first step towards tolerating node failures.
    • Important to say right at the very beginning that our primary-backup protocol has many shortcomings! We will discuss these extensively, and fix them later in the quarter.
  • Basic idea of primary-backup replication is to have two servers: the primary and the backup.
    • We will design the protocol so that if either the primary or the backup crashes, clients will still be able to get a response from the system.
  • We want the system to be "equivalent" from a client's perspective, to a single copy of the application.
    • More formally, we want the system to be linearizable. We will talk more about that soon.
    • For now, one key thing we want to avoid is two different copies of the application both responding to client requests with independent and contradictory results.
    • For example, in the key-value store, if client 1 appends "x" to key "foo", and client 2 appends "y" to key "foo", then there are several allowed responses:
      • (assume that before this point, key "foo" stored no value / the empty string)
      • client 1's request gets there first, so it gets result "x". client 2's request gets there second, so it gets result "xy".
      • or the other way around: client 2's request is first, result "y"; client 1's request is second, result "yx"
      • What would not be allowed is client 1 getting result "x" and client 2 getting result "y", which would indicate that somehow both client's requests were executed "first".
        • The way this would happen is typically due to a "split brain" problem, where two copies of the application both somehow think they are "in charge" but they are operating independently and not staying in sync. So they can give contradictory answers like this because they are not even talking to each other.

State-Machine Replication

Before we get to the details of primary-backup, let's reflect some more on the way RPCs were set up in lab 1.

Input/Output State Machines (Applications in the labs)

  • In the labs, our RPC implementation allows Commands to be submitted to an Application, which returns Results.
  • This Application can also be thought of as a state machine (but a different kind than transition systems!)
    • Remember that in CS, the phrase "state machine" can mean lots of different things. Usually it means a finite state machine like a DFA or an NFA, but in this class, we will use it to mean something like an Application.
  • These kinds of state machines have a (potentially infinite) state space (e.g., a key-value store might have a Map<String,String> as its state)
  • And, they execute by taking a sequence of steps.
    • At each step, it:
      • consumes exactly one input (called a Command in the labs)
      • produces exactly one output (called a Result in the labs)
      • updates its state
  • Our RPC implementation doesn't care exactly what the Application does, as long as it has this interface.
  • We will use the phrase "input/output state machine" to refer to this kind of state machine, represented by Application in the labs.
    • Again, not to be confused with transition systems, which are similar, but don't have input/output at each step.

RPC as hosted state machines

  • In lab 1, you are implementing an RPC service that hosts an application over the network.
  • It accepts Commands from clients over the network and sends back Results also over the network
  • Separates concerns between network issues versus the Application
    • In lab 1, you are implementing a hosted key-value store.
    • But the key-value store is just an Application that doesn't know anything about the network.
      • Just define the state, Commands, and Results, and then implement the execute method.
    • And the RPC client and server, plus the AMOApplication wrapper together work to host the Application, but they don't care at all about its internals!
  • This separation is really great!
    • It means that we can swap out the Application for a different one, and all the RPC stuff will continue to work just fine to host the new Application over the network.
    • And it means we can swap out the hosting mechanism, by replacing the simple RPC service with something more sophisticated.

Primary-Backup State-Machine Replication

The basic idea

  • Our Primary-Backup system will host state machines much in the same way that the RPC service of lab 1 did.
  • Main difference is that there will be multiple copies of the Application, one for each replica in the system.
  • The next figure shows is the idea of how the primary-backup system will work in the normal case
  • (Note that we are omitting several details here that will be expanded on below. This is just to give you the idea.)
  • There are two servers, the primary and the backup. They each have a copy of the Application.
    • Clients submit requests to the primary, who replicates them to the backup before responding.
  • This basic workflow immediately raises several challenges:
    1. We now have two copies of the Application. How do we make sure that they are in sync at all times?
    2. Each request gets executed twice—once on the primary and once on the backup. How do we know the state changes and results will be the same? Also, how do we prevent global side effects of the request from occurring twice?
  • Our goal is to design a protocol that addresses these challenges. We will ensure that:
    • the backup is at most one request ahead of the primary
      • Side note: It is absolutely fundamental to distributed systems that you cannot keep to copies of some data on two different machines perfectly in sync.
        • If the primary tells the backup to execute a request, the primary cannot know exactly when the backup will execute it, because the network has arbitrary delay.
        • Since it can't know exactly when the backup will execute it, the primary has no way to execute the request at exactly the same time.
        • So they must be a little bit out sync, just because of the potential for network delay.
        • But, we can have the backup inform the primary when the request is finished executing, so that the primary can execute it and move on.
        • This puts a bound on how far out of sync the two replicas can get, and is a very common idea in distributed protocol design.
    • the primary and backup execute exactly the same sequence of requests in the same order
    • the Application is deterministic and has no global side-effects
      • ("Deterministic" means that if you execute the same request from the same state, you always get the same state changes and response. The key-value store Application is deterministic, and we will assume that all our Applications are deterministic in this class.)

Node Failures

  • Remember that our goal was to improve our RPC system by tolerating (some) node failures. How are we doing so far?
  • Well, let's consider what happens if the primary fails.
    • That was the server that clients were submitting requests to. So if that server is no longer available, we need to establish a new server as primary and tell clients to go talk to that server.
      • (In our protocol, we will always select the previous backup as the new primary. This will turn out to be crucial for our protocol's correctness.)
  • If the backup fails, then the primary won't be able to replicate anymore requests to the backup, so again clients stop making progress.
    • We need to either let the primary declare the backup dead and continue without replicating, or we need to replace the backup.
  • In either case, the basic idea is to failover to another server. ("Failover" basically means to choose a new server to play the role of the old server that failed.)
    • To do this, we need to first detect that the server has crashed, then establish a replacement, and inform everyone who needs to know to talk to the replacement instead.
  • Here is a straw proposal that does not work at all (do not implement this in lab 2!)
    • When the primary hasn't heard from the backup for a while, the primary can decide to replace it with a new backup.
    • Similarly, when the backup hasn't heard from the primary in a while, the backup can decide to become the new primary.
    • Why is this a bad idea? Suppose the network link between the primary and backup is down, so they cannot exchange messages.
      • Then neither of them will hear from the other, so they will both assume the other node has failed, even though actually both nodes are fine.
      • So the primary will pick a new backup and start accepting client requests again by replicating them to the new backup.
      • But the old backup will declare itself the new primary, select itself another backup, and also start accepting client requests.
    • This leads to a scenario where we have two completely separate primaries that are functioning independently and accepting client requests.
      • Their Applications will diverge, and the result is a system that does not at all behave like our old single RPC server.
      • In distributed systems, people often call this situation "split brain" because the system has essentially split into two independent systems that don't know about each other (which is bad, because the system is supposed to present a unified view of the Application to the clients).

  • Remember that it's a fundamental property of distributed systems that you cannot distinguish node failure from network failure. The symptoms of both failures are that you don't hear from the other node.
    • So it is impossible to accurately detect whether a node has crashed.
  • That's why it's a bad idea to allow different nodes to decide on their own that some other nodes are dead.
  • Instead, we need to find a way to tolerate failures that doesn't depend on accurately detecting whether those failures actually occurred or not.

The centralized view service

  • One way to avoid nodes disagreeing about whether other nodes have crashed is to have one node make all of those decisions on everyone's behalf. This is the approach we will take in lab 2.
  • The idea is to introduce a separate node that we will call the view server or view service, that monitors all the other nodes in the system and decides which ones of them have crashed.
    • Note that view server can be wrong about whether a node is actually up or down, but it doesn't matter. Other nodes will always take the view server's opinion about which nodes are up as if it was the truth.
  • Once the view server has an idea about which nodes are up, it can select a primary and a backup and inform everyone (including clients).
  • If a node goes down, the view server will detect it and orchestrate a failover to replace that node.
  • Notice that this solves the split brain issue above, because now all decisions about failover are centralized at the view server. Yay!
  • Unfortunately, this approach also has a major drawback, which is that if the view service crashes, the system will not be able to tell who is the primary and who is the backup, and no further failovers will be possible.
    • This is unsatisfying! But the protocol is already quite interesting and complex enough that it makes for excellent practice in lab 2.
    • In a few weeks, we will study Paxos, which is a fully general replication solution that can tolerate many node failures.
      • It is quite a bit more subtle and complicated than this protocol, so warming up with a view-service-based primary-backup protocol in lab 2 will prepare you well for lab 3.
    • It's ok to be a little disappointed in this protocol! We promise it gets more satisfying when we get to Paxos.
  • For lab 2 and the rest of our discussion of primary-backup, we will assume that the view server never crashes.
  • That said, notice that this design is already quite a bit better than our single RPC server in lab 1
    • With only one server, if that server crashed, no client could make progress.
    • But with primary-backup, we will be able to tolerate either of those nodes crashing, and clients will still be able to make progress after failover.
      • If the view server crashes, then we won't be able to failover, but we've still tolerated crashes in all but one of our nodes!

Recap so far

  • The basic setup is that there will be a primary server and a backup server working together to serve client requests. Here is the "normal case" workflow.
    • Clients send requests to the primary, who forwards them to the backup.
    • The backup executes the request and sends an acknowledgement to the primary.
    • The primary then executes the request and responds to the client.
  • This basic works fine as long as no (node) failures occur.
  • When either the primary or the backup fails, we need to do "failover", meaning that we replace the failed server with another available server.
    • An easy approach to failover is manual failover. The human operator of the system can stop all incoming client requests. Manually figure out which server crashed, decide what server to replace it with, reconfigure the system to use the new server, make sure the state of both servers is up to date and in sync, and then allow incoming client requests to resume
    • The downside of manual failover is that it is manual.
  • We started to discuss an automated failover solution that used a centralized view service.
    • The view service will be a single node whose job is to decide what other servers are up/down.
    • (Accurate failure detection is impossible in our fault model, so the view service can and will be wrong about whether a server has crashed or not. That turns out not to matter too much. What's most important is that, by centralizing the fault detection in one node, we guarantee that nobody will disagree about whether some node is up or down. The answer is always just whatever the view service thinks, even if it is wrong.)

Extending the protocol with the view server

  • There are three kinds of nodes:
    • some number of clients
    • some number of servers
    • one view server
  • The clients want to submit Commands and get Results.
  • The servers are willing to host the Application and play the role of primary or backup.
  • The view service decides which servers are playing which role currently.

  • All servers send ping messages to the view server regularly
  • The view server keeps track of the set of servers who have pinged it recently and considers those servers to be up.
  • The view server selects a server from the set of "up" servers to serve as the primary and another to serve as the backup.
  • Later, if the primary or backup stops pinging the view server, then the view server will change its mind about who is primary and who is backup.
  • To support the view server changing its mind over time, we will introduce "views", whose name means "the view server's current view of who is primary and backup".
    • Views will have a version number called the view number that increments every time the view server wants to do a failover
    • When the view server increments the view number and selects a new primary and backup, this is called a "view change".
  • Since the primary and backup roles are played by different servers over time, the only way a client can know who the current primary is is to ask the view server.
  • Within a single view, there is at most one primary and at most backup.

  • A view contains three pieces of information
    • The view number (new views have higher view numbers)
    • The node who will play the role of primary in this view
    • The node who will play the role of backup in this view
  • When a client wants to submit a request, they first ask the view server for the current view
    • Then they go talk to the primary of that view
    • Clients are allowed to cache the view as long as they want, and keep talking to that primary.
      • If they don't hear from the primary after several retries, then they can go ask the view server whether there is a new view.

Overview of the full protocol

  • There are several different kinds of nodes in the system.
    • There is one view server that provides views to the rest of the system.
    • There are some number of clients that want to submit requests to the system.
    • There are some number of servers that are available to play the roles of primary and backup (in a particular view) when asked to do so by the view service.
  • To execute their requests, clients will follow the "normal case" workflow.
    • This requires that clients know who the primary is.
    • They will get this information by asking the view service.
  • Servers do what they are told by the view service. If a server \(S_1\) was primary of view number 1, and then later it hears that the view service moved to view number 2, then \(S_1\) stops doing what it was doing and moves into the new view, taking whatever role assigned to it by the view service.
    • \(S_1\) does this even if the reason the view service moved into view 2 was because the view service thought that \(S_1\) was down even though it was not. \(S_1\) does not try to "correct" the view service, but just does what it was told.
    • (It's more important to have a consistent source of information about what servers are up or down than it is to have correct information about that. (Although the information should be right most of the time in order for the system to make progress.))
  • The view service tracks which servers are up and assigns them roles in each view.

Scenarios

Here are a few scenarios that can happen during the lifetime of the system.

  • At the beginning of time, the view service has received no pings yet, so it does not know if any servers are up. So there is no primary and no backup yet.
    • In the labs, this is view number 0, called the "startup view".
    • It is not a fully functional view: clients cannot execute requests because there is no primary yet.
    • Every view numbered larger than 0 will have a primary.
  • As soon as the view service hears one ping, it will select that server as the primary of view number 1, called the "initial view" in the labs.
    • There is no backup in view number 1.
    • In general, the view service is allowed to create views that have a primary but no backup, if it does not have enough available servers to find a backup.
    • When operating without a backup, the primary acts essentially like the RPC server from lab 1. It just executes any client requests it receives and responds to them.
    • In this scenario, if the primary fails, the system is stuck forever.
  • Later, after fully setting up view number 1, the view service will hopefully have heard pings from at least one other server. It can then select one of those servers to be backup.
    • Since the view server is updating who is backup, it must create a new view by incrementing the view number again. It should not "update" the backup for the already created view number 1.
  • In a typical view when everything is working well, there will be a primary and a backup.
    • Clients can learn who the primary is by asking the view service.
    • Clients then submit their requests to that primary, who follows the "normal case" workflow to execute those requests.
    • Things go on like this until some kind of failure occurs.

Failure scenarios

The backup fails

  • Suppose we are in view number 3 with a primary \(S_1\) and a backup \(S_2\) , and then \(S_2\) fails.
  • Clients can no longer successfully execute requests, because the primary needs to talk to the backup before it executes each requests, but the backup is down.
  • The view service will detect that the backup is down because it stops pinging.
  • The view service initiates a view change by incrementing the view number to 4 and selecting a new primary and backup.
    • Since the primary did not fail, the view service leaves the primary as is, so the primary for view number 4 is also \(S_1\).
    • The view service selects a new backup server from the set of available servers. Let's say it selects \(S_3\), so \(S_3\) is the backup for view number 4.
  • We want to start executing client requests again. To do that, we need to be able to execute them both on the backup and on the primary, and we need to know that those two applications are in sync with each other.
    • It's important that they are in sync so that if another failure occurs, say the primary fails later, the backup can take over and return correct answers.
  • The problem is that \(S_3\) was neither primary nor backup in the previous view, so it does not have an up-to-date copy of the application. So it is not ready to begin executing client requests.
  • In order to prepare the new backup, the primary needs to do a state transfer, which involves sending a copy of the entire current application state to the backup.
  • The backup replaces whatever application state it had with what the primary sends it, and then acknowledges back to the primary that it has received the state.
  • Now the primary can consider the view to be really started, and start executing client requests.
    • Before the state transfer has been acknowledged, the primary cannot accept any client requests.
  • Once the state is transferred to the backup, the protocol is fault tolerant again.
    • We need to tell the view service that this has happened! Otherwise, the view service will not know that it is ok to do another failover if it detects the new primary has crashed.
    • If, on the other hand, the new primary crashed before finishing the state transfer, then the system is stuck. It cannot tolerate two node failures that happen so soon one after the other that we didn't have a chance to complete the state transfer.
  • We also need to tolerate message drops during the state transfer process.
    • The primary needs to retransmit the state transfer message until it gets an acknowledgement.
  • We also need to tolerate message dups and delays during the state transfer process.
    • The backup should not perform a state transfer for an old view, nor for the current view if it already has done a state transfer.

The primary fails

Consider a similar situation to the previous section.

  • Now suppose we are in view number 3 with a primary \(S_1\) and a backup \(S_2\), and then \(S_1\) fails.
  • Clients can no longer successfully execute requests, because the primary is down.
  • The view service will detect that the primary is down because it stops pinging.
  • The view service initiates a view change by incrementing the view number to 4 and selecting a new primary and backup.
  • We know that the first thing the new primary is going to do is state transfer.
    • Therefore, in order not to lose committed data, the only reasonable choice for the new primary is the old backup, since the old backup is the only available server with an up-to-date copy of the application.
    • The key constraint is: if a client has received a response to a request in the past, then we must guarantee that the new primary has an application state that reflects the results of that request.
    • Again, the only way to satisfy this constraint is by selecting the old backup as the new primary.
  • So the view service will select \(S_2\) as the new primary, and any other available server, say \(S_3\) as the new backup.
  • The view service informs servers about the new view. The new primary will initiate a state transfer as before. Once the state transfer is complete, client requests can resume being executed.

The situation with no available servers

  • Now suppose we are in view number 3 with a primary \(S_1\) and a backup \(S_2\), and that there are no other servers available. (There might be other servers, but they already crashed or for some reason they cannot ping the view service at the moment, so they are not available to be primary or backup.)
  • If either server fails, there is no server to replace it. What should we do?
    • One option would be to do nothing. Just refuse to execute client requests.
    • At least this doesn't return wrong answers! But it also doesn't make progress.
  • A slightly better approach is to move to a view where the one remaining available server acts as primary on its own.
    • In this situation, we cannot tolerate any more crashes, but at least we can make progress after the previous crash.
    • If we get lucky, and some other server that the view service thought was down actually was ok and starts pinging again, then we can view change to yet another view with a primary and a backup, restoring fault tolerance.
    • If the primary crashes during a view without a backup, then the system is stuck forever, even if some other servers later become available.
      • We cannot use those nodes because they do not satisfy the constraint that their application state reflects the results of all requests that clients have received responses for.
      • The only way out of this situation is if the most recent primary somehow comes back. The view service can then change to a view using one of the new servers as the backup.
  • If the primary does not have a backup, then it executes client requests by acting alone, similar to the RPC server from lab 1. It does not forward the requests anywhere, but instead immediately executes them and responds to the client.

Lab 2

  • We have presented an overview of the kind of primary-backup protocol we want you to design in lab 2.
  • We have intentionally left out many details! These are for you to design and fill in.
    • There are many different ways to get a correct protocol. There is not just one right answer.
  • The act of taking high level ideas and turning them into concrete distributed protocols that can be implemented, evaluated, and deployed is a core skill that we hope you develop in this course!
  • Do not be surprised if the design process is harder than the implementation process in labs 2 through 4!

Consistency Models and Linearizability

Agenda

  • Answer the question "What does it mean for a state-machine replication algorithm to be correct?"

Correctness of state-machine replication

  • It's actually sort of surprising that we haven't had to answer this question yet!
  • Our hand wavy answer so far has been that the replicas are "equivalent" to one machine from the clients perspective.
  • In other words, clients can pretend that there's just one copy of the Application, that just happens to be very fault tolerant. - There are some (mostly irrelevant) details here because clients have to send their operations to the primary, which might change over time, so they have to be aware that there are multiple servers. But this detail is hidden by the client library that we, the authors of primary-backup write. - So the application-layer's API to the state machine is actually the same as in lab 1: sendCommand()/getResult().
  • If we look at state-machine replication at a high level, we have:
    • an opaque box in charge of the replication
    • clients send requests into the box and wait to get responses back
    • there can be multiple clients interacting with the system at once

The case of one client submitting one request at a time

  • If we think about the simplest case where we only have one client, and they only send one request at a time, then the system should evolve "linearly":
    • the client retransmits its current request for ever until it gets a response
    • then it sends the next request, and so on
  • So in the situation with one client, the system is "equivalent" to executing the clients' commands in order, starting from the initial state of the state machine/Application.
    • Since the Application is (assumed to be) deterministic, there is only one right answer for a particular sequence of commands, so this tells us whether the system was correct or not (did it return the right answers?)

The case of two client each submitting one request

  • Now consider the case where there are two clients, and they each submit a request.
  • These requests can arrive at servers in the system in either order.
    • Do we need to try to reorder them somehow?
  • Intuitively, no. Either order is ok.
    • The reason is that the clients are prepared to handle delays in the network, so they cannot possibly be expecting one to definitely arrive first. So we are free to execute them in either order.

Consistency models

  • The above examples show us the flavor of a consistency model.
    • Given an execution of a distributed system, a consistency model says whether or not it is "correct".
    • Different consistency models have different definitions of correct.
  • For this discussion, we are going to consider two forms of execution
    • The bird's-eye space-time diagram model
    • The request-response-execution model

Bird's-eye space-time diagrams

  • We have seen a few of these before.
  • Here we'll use a simplified version of this kind of diagram, where we only draw clients, not servers.
  • Instead of showing requests going into the state-machine replication box and responses coming out, we will abbreviate this using the "regions of time" notation.
    • A region begins when the client first sends the request to the system. Labeled with the request.
    • A region ends when the client first receives the response to the current request (clients only have one outstanding request). Labeled with the response.
  • We represent the timing information visually in the picture, but you could equivalently think of it as labeling the beginning and end of each region with a bird's-eye timestamp.

Request-response execution model

  • A list of all client requests and their response in event order.

  • So a consistency model takes one of these space-time diagrams as input and tells you either "yes, that is allowed" or "no, that is not allowed".

Sequential consistency

  • This is a consistency model with this definition:
    • there is a global order on requests;
    • responses are computed as if requests executed in that global order; and,
    • the global order agrees with the client's local order

Linearizability

  • This is a consistency model with this definition:
    • Everything from sequential consistency; and
    • If request r2 is submitted after response to r1, then r1 appears before r2 in the global order.