# Lecture 5 and 6: Primary-Backup ## Agenda - Primary-Backup replication - lab 2 - more about lab 2 in section tomorrow - lab 2 due 1/28/22 - We will return to our discussion about transition systems soon! ## Motivation for Primary-Backup ### So far - We've talked about RPC (and you're working on lab 1) - Clients send commands to server, get responses - The server executes the requests in the order it receives them - We can use sequence numbers and retransmission to handle all network failures (in our failure model): drops, dups, delays, reorderings. - Great job tolerating all network failures in the model! Yay! - If the server crashes, then the system stops working. So our RPC system doesn't do a very good job at tolerating node failures at all. ### Towards tolerating node failures - Goal: tolerate (some) node failures - Technique: replicate - Have multiple copies of (essentially) our RPC server, so that if some of the copies crash, the others can still make progress. - Primary-backup replication will be our first step towards tolerating node failures. - Important to say right at the very beginning that our primary-backup protocol has many shortcomings! We will discuss these extensively, and fix them later in the quarter. - Basic idea of primary-backup replication is to have two servers: the primary and the backup. - We will design the protocol so that if either the primary or the backup crashes, clients will still be able to get a response from the system. ## State-Machine Replication Before we get to the details of primary-backup, let's reflect some more on the way RPCs were set up in lab 1. ### Input/Output State Machines (`Application`s in the labs) - In the labs, our RPC implementation allows `Command`s to be submitted to an `Application`, which returns `Result`s. - This `Application` can also be thought of as a state machine (but a different kind than transition systems!) - Remember that in CS, the phrase "state machine" can mean lots of different things. *Usually* it means a finite state machine like a DFA or an NFA, but in this class, we will use it to mean something like an `Application`. - These kinds of state machines have a (potentially infinite) state space (e.g., a key-value store might have a Map<String,String> as its state) - And, they execute by taking a sequence of steps. - At each step, it: - consumes exactly one input (called a `Command` in the labs) - produces exactly one output (called a `Result` in the labs) - updates its state - Our RPC implementation doesn't care exactly what the `Application` does, as long as it has this interface. - We will use the phrase "input/output state machine" to refer to this kind of state machine, represented by `Application` in the labs. - Again, not to be confused with transition systems, which are similar, but don't have input/output at each step. ### RPC as hosted state machines - In lab 1, you are implementing an RPC service that *hosts* an application over the network. - It accepts `Command`s from clients over the network and sends back `Result`s also over the network - Separates concerns between network issues versus the `Application` - In lab 1, you are implementing a hosted key-value store. - But the key-value store is just an `Application` that doesn't know anything about the network. - Just define the state, `Command`s, and `Results`, and then implement the execute method. - And the RPC client and server, plus the `AMOApplication` wrapper together work to host the `Application`, but they don't care at all about its internals! - This separation is really great! - It means that we can swap out the `Application` for a different one, and all the RPC stuff will continue to work just fine to host the new `Application` over the network. - And it means we can swap out the hosting mechanism, by replacing the simple RPC service with something more sophisticated. ## Primary-Backup State-Machine Replication ### The basic idea - Our Primary-Backup system will host state machines much in the same way that the RPC service of lab 1 did. - Main difference is that there will be *multiple* copies of the `Application`, one for each replica in the system. - The next figure shows is the idea of how the primary-backup system will work in the normal case - (Note that we are omitting several details here that will be expanded on below. This is just to give you the idea.)

- There are two servers, the primary and the backup. They each have a copy of the `Application`. - Clients submit requests to the primary, who replicates them to the backup before responding. - This basic workflow immediately raises several challenges: 1. We now have two copies of the `Application`. How do we make sure that they are in sync at all times? 1. Each request gets executed twice—once on the primary and once on the backup. How do we know the state changes and results will be the same? Also, how do we prevent global side effects of the request from occurring twice? - Our goal is to design a protocol that addresses these challenges. We will ensure that: - the backup is *at most one* request ahead of the primary - Side note: It is absolutely fundamental to distributed systems that you cannot keep to copies of some data on two different machines perfectly in sync. - If the primary tells the backup to execute a request, the primary cannot know exactly when the backup will execute it, because the network has arbitrary delay. - Since it can't know exactly when the backup will execute it, the primary has no way to execute the request at exactly the same time. - So they *must* be a little bit out sync, just because of the potential for network delay. - But, we can have the backup inform the primary when the request is finished executing, so that the primary can execute it and move on. - This puts a *bound* on how far out of sync the two replicas can get, and is a very common idea in distributed protocol design. - the primary and backup execute exactly the same sequence of requests in the same order - the `Application` is deterministic and has no global side-effects - ("Deterministic" means that if you execute the same request from the same state, you always get the same state changes and response. The key-value store `Application` is deterministic, and we will assume that all our `Application`s are deterministic in this class.) ### Node Failures - Remember that our goal was to improve our RPC system by tolerating (some) node failures. How are we doing so far? - Well, let's consider what happens if the primary fails. - That was the server that clients were submitting requests to. So if that server is no longer available, we need to establish a new server as primary and tell clients to go talk to that server. - (In our protocol, we will always select the previous backup as the new primary. This will turn out to be crucial for our protocol's correctness.) - If the backup fails, then the primary won't be able to replicate anymore requests to the backup, so again clients stop making progress. - We need to either let the primary declare the backup dead and continue without replicating, or we need to replace the backup. - In either case, the basic idea is to *failover* to another server. ("Failover" basically means to choose a new server to play the role of the old server that failed.) - To do this, we need to first detect that the server has crashed, then establish a replacement, and inform everyone who needs to know to talk to the replacement instead. - Here is a straw proposal that does not work at all (**do not implement this in lab 2!**) - When the primary hasn't heard from the backup for a while, the primary can decide to replace it with a new backup. - Similarly, when the backup hasn't heard from the primary in a while, the backup can decide to become the new primary. - Why is this a bad idea? Suppose the network link between the primary and backup is down, so they cannot exchange messages. - Then neither of them will hear from the other, so they will both assume the other node has failed, even though actually both nodes are fine. - So the primary will pick a new backup and start accepting client requests again by replicating them to the new backup. - But the old backup will declare itself the new primary, select itself another backup, and also start accepting client requests. - This leads to a scenario where we have two completely separate primaries that are functioning independently and accepting client requests. - Their `Application`s will diverge, and the result is a system that does not at all behave like our old single RPC server. - In distributed systems, people often call this situation "split brain" because the system has essentially split into two independent systems that don't know about each other (which is bad, because the system is supposed to present a unified view of the `Application` to the clients).

- Remember that it's a fundamental property of distributed systems that you cannot distinguish node failure from network failure. The symptoms of both failures are that you don't hear from the other node. - So it is impossible to accurately detect whether a node has crashed. - That's why it's a bad idea to allow different nodes to decide on their own that some other nodes are dead. - Instead, we need to find a way to tolerate failures that doesn't depend on accurately detecting whether those failures actually occurred or not. ### The centralized view service - One way to avoid nodes disagreeing about whether other nodes have crashed is to have one node make all of those decisions on everyone's behalf. This is the approach we will take in lab 2. - The idea is to introduce a separate node that we will call the view server or view service, that monitors all the other nodes in the system and decides which ones of them have crashed. - Note that view server can be wrong about whether a node is actually up or down, but it doesn't matter. Other nodes will always take the view server's opinion about which nodes are up as if it was the truth. - Once the view server has an idea about which nodes are up, it can select a primary and a backup and inform everyone (including clients). - If a node goes down, the view server will detect it and orchestrate a failover to replace that node. - Notice that this solves the split brain issue above, because now all decisions about failover are centralized at the view server. Yay! - Unfortunately, this approach also has a major drawback, which is that if the view service crashes, the system will not be able to tell who is the primary and who is the backup, and no further failovers will be possible. - This is unsatisfying! But the protocol is already quite interesting and complex enough that it makes for excellent practice in lab 2. - In a few weeks, we will study Paxos, which is a fully general replication solution that can tolerate many node failures. - It is quite a bit more subtle and complicated than this protocol, so warming up with a view-service-based primary-backup protocol in lab 2 will prepare you well for lab 3. - It's ok to be a little disappointed in this protocol! We promise it gets more satisfying when we get to Paxos. - For lab 2 and the rest of our discussion of primary-backup, we will assume that the view server never crashes. - That said, notice that this design is already quite a bit better than our single RPC server in lab 1 - With only one server, if that server crashed, no client could make progress. - But with primary-backup, we will be able to tolerate either of those nodes crashing, and clients will still be able to make progress after failover. - If the view server crashes, then we won't be able to failover, but we've still tolerated crashes in all but one of our nodes! ### Extending the protocol with the view server - There are three kinds of nodes: - some number of clients - some number of servers - one view server - The clients want to submit `Command`s and get `Result`s. - The servers are willing to host the `Application` and play the role of primary or backup. - The view service decides which servers are playing which role currently.

- All servers send ping messages to the view server regularly - The view server keeps track of the set of servers who have pinged it recently and considers those servers to be up. - The view server selects a server from the set of "up" servers to serve as the primary and another to serve as the backup. - Later, if the primary or backup stops pinging the view server, then the view server will change its mind about who is primary and who is backup. - To support the view server changing its mind over time, we will introduce "views", whose name means "the view server's current *view* of who is primary and backup". - Views will have a version number called the view number that increments every time the view server wants to do a failover - When the view server increments the view number and selects a new primary and backup, this is called a "view change". - Since the primary and backup roles are played by different servers over time, the only way a client can know who the current primary is is to ask the view server.

- A view contains three pieces of information - The view number (new views have higher view numbers) - The node who will play the role of primary in this view - The node who will play the role of backup in this view - When a client wants to submit a request, they first ask the view server for the current view - Then they go talk to the primary of that view - Clients are allowed to cache the view as long as they want, and keep talking to that primary. - If they don't hear from the primary after several retries, then they can go ask the view server whether there is a new view.g