We've talked about RPC (and you've implemented it in lab 1)
Clients send commands to server, get responses
The server executes the requests in the order it receives them
We can use sequence numbers and retransmission to handle all network failures (in our failure model):
drops, dups, delays, reorderings.
Great job tolerating all network failures in the model! Yay!
If the server crashes, then the system stops working. So our RPC system doesn't do a very good job at tolerating node failures at all.
Even recovering manually from such a crash would be difficult, if the data was only stored in memory it might just be gone.
Towards tolerating node failures
Goal: tolerate (some) node failures
Technique: replicate
Have multiple copies of (essentially) our RPC server, so that if some of the copies crash, the others can still make progress.
Straw proposal: Naive client-driven replication
Here is another one of James's famously bad straw proposals:
Clients operate similarly to lab 1 (part 3), except now there are two servers.
Clients submit requests to both servers
They wait until they hear back both responses (retransmitting as needed), then return the results (which hopefully agree!) to the application layer
The servers behave exactly like lab 1 (part 3), including the at-most-once stuff.
If one of the servers goes down, we can manually reconfigure the system to revert to using just the other server.
Note that we couldn't do this if we weren't sending requests to both
servers, since in that case, only one of the servers might have the data.
What goes wrong?
Is this protocol correct? Does it handle network failures? (We said it sort of manually handles node failures.)
No! Suppose we have two clients that both submit a request.
Each client sends its request to both servers
If those requests arrive in different orders at the two servers, each server will happily execute them in the order it received them.
The server's state diverges, clients get contradictory responses, everything breaks.
It's not that surprising that this doesn't work. In this straw proposal, the servers never talk to each other!
Primary backup basics
Primary-backup replication will be our first step towards tolerating node failures.
Important to say right at the very beginning that our primary-backup protocol has many shortcomings!
We will discuss these extensively, and fix them later in the quarter.
Basic idea of primary-backup replication is to have two servers: the primary and the backup.
We will design the protocol so that if either the primary or the backup crashes, clients will still be able to get a response from the system.
We want the system to be "equivalent" from a client's perspective, to a single copy of the application.
More formally, we want the system to be linearizable. We will talk more about that soon.
For now, one key thing we want to avoid is two different copies of the
application both responding to client requests with independent and
contradictory results.
For example, in the key-value store, if client 1 appends "x" to key "foo",
and client 2 appends "y" to key "foo", then there are several allowed responses:
(assume that before this point, key "foo" stored no value / the empty string)
client 1's request gets there first, so it gets result "x". client 2's request gets there second, so it gets result "xy".
or the other way around: client 2's request is first, result "y"; client 1's request is second, result "yx"
What would not be allowed is client 1 getting result "x" and client
2 getting result "y", which would indicate that somehow both client's
requests were executed "first".
The way this would happen is typically due to a "split brain"
problem, where two copies of the application both somehow think
they are "in charge" but they are operating independently and not
staying in sync. So they can give contradictory answers like this
because they are not even talking to each other.
State-Machine Replication
Before we get to the details of primary-backup, let's reflect some more on the way RPCs were set up in lab 1.
Input/Output State Machines (Applications in the labs)
In the labs, our RPC implementation allows Commands to be submitted to an Application, which returns Results.
This Application can also be thought of as a state machine (but a different kind than transition systems!)
Remember that in CS, the phrase "state machine" can mean lots of different things. Usually it means a finite state machine like a DFA or an NFA,
but in this class, we will use it to mean something like an Application.
These kinds of state machines have a (potentially infinite) state space (e.g., a key-value store might have a Map<String,String> as its state)
And, they execute by taking a sequence of steps.
At each step, it:
consumes exactly one input (called a Command in the labs)
produces exactly one output (called a Result in the labs)
updates its state
Our RPC implementation doesn't care exactly what the Application does, as long as it has this interface.
We will use the phrase "input/output state machine" to refer to this kind of state machine, represented by Application in the labs.
Again, not to be confused with transition systems, which are similar, but don't have input/output at each step.
RPC as hosted state machines
In lab 1, you are implementing an RPC service that hosts an application over the network.
It accepts Commands from clients over the network and sends back Results also over the network
Separates concerns between network issues versus the Application
In lab 1, you are implementing a hosted key-value store.
But the key-value store is just an Application that doesn't know anything about the network.
Just define the state, Commands, and Results, and then implement the execute method.
And the RPC client and server, plus the AMOApplication wrapper together work to host the Application, but they don't care at all about its internals!
This separation is really great!
It means that we can swap out the Application for a different
one, and all the RPC stuff will continue to work just fine to host
the new Application over the network.
And it means we can swap out the hosting mechanism, by replacing the simple RPC service with something more sophisticated.
Primary-Backup State-Machine Replication
The basic idea
Our Primary-Backup system will host state machines much in the same way that the RPC service of lab 1 did.
Main difference is that there will be multiple copies of the Application, one for each replica in the system.
The next figure shows is the idea of how the primary-backup system will work in the normal case
(Note that we are omitting several details here that will be expanded on below. This is just to give you the idea.)
There are two servers, the primary and the backup. They each have a copy of the Application.
Clients submit requests to the primary, who replicates them to the backup before responding.
This basic workflow immediately raises several challenges:
We now have two copies of the Application. How do we make sure that they are in sync at all times?
Each request gets executed twice—once on the primary and once on the backup. How do we know the state changes and results will be the same?
Also, how do we prevent global side effects of the request from occurring twice?
Our goal is to design a protocol that addresses these challenges. We will ensure that:
the backup is at most one request ahead of the primary
Side note: It is absolutely fundamental to distributed systems that you cannot keep to copies of some data on two different machines perfectly in sync.
If the primary tells the backup to execute a request, the primary cannot know exactly when the backup will execute it, because the network has arbitrary delay.
Since it can't know exactly when the backup will execute it, the primary has no way to execute the request at exactly the same time.
So they must be a little bit out sync, just because of the potential for network delay.
But, we can have the backup inform the primary when the request is finished executing, so that the primary can execute it and move on.
This puts a bound on how far out of sync the two replicas can get, and is a very common idea in distributed protocol design.
the primary and backup execute exactly the same sequence of requests in the same order
the Application is deterministic and has no global side-effects
("Deterministic" means that if you execute the same request from the same state, you always get the same state changes and response.
The key-value store Application is deterministic, and we will assume that all our Applications are deterministic in this class.)
Node Failures
Remember that our goal was to improve our RPC system by tolerating (some) node failures. How are we doing so far?
Well, let's consider what happens if the primary fails.
That was the server that clients were submitting requests to. So
if that server is no longer available, we need to establish a new
server as primary and tell clients to go talk to that server.
(In our protocol, we will always select the previous backup as the new primary. This will turn out to be crucial for our protocol's correctness.)
If the backup fails, then the primary won't be able to replicate anymore requests to the backup, so again clients stop making progress.
We need to either let the primary declare the backup dead and continue without replicating, or we need to replace the backup.
In either case, the basic idea is to failover to another server. ("Failover" basically means to choose a new server to play the role of the old server that failed.)
To do this, we need to first detect that the server has crashed, then establish a replacement, and inform everyone who needs to know to talk to the replacement instead.
Here is a straw proposal that does not work at all (do not implement this in lab 2!)
When the primary hasn't heard from the backup for a while, the primary can decide to replace it with a new backup.
Similarly, when the backup hasn't heard from the primary in a while, the backup can decide to become the new primary.
Why is this a bad idea? Suppose the network link between the primary and backup is down, so they cannot exchange messages.
Then neither of them will hear from the other, so they will both assume the other node has failed, even though actually both nodes are fine.
So the primary will pick a new backup and start accepting client requests again by replicating them to the new backup.
But the old backup will declare itself the new primary, select itself another backup, and also start accepting client requests.
This leads to a scenario where we have two completely separate primaries that are functioning independently and accepting client requests.
Their Applications will diverge, and the result is a system that does not at all behave like our old single RPC server.
In distributed systems, people often call this situation "split brain" because the system has essentially split into two independent systems
that don't know about each other (which is bad, because the system is supposed to present a unified view of the Application to the clients).
Remember that it's a fundamental property of distributed systems that you cannot distinguish node failure from network failure.
The symptoms of both failures are that you don't hear from the other node.
So it is impossible to accurately detect whether a node has crashed.
That's why it's a bad idea to allow different nodes to decide on their own that some other nodes are dead.
Instead, we need to find a way to tolerate failures that doesn't depend on accurately detecting whether those failures actually occurred or not.
The centralized view service
One way to avoid nodes disagreeing about whether other nodes have crashed is to have one node make all of those decisions on everyone's behalf.
This is the approach we will take in lab 2.
The idea is to introduce a separate node that we will call the view server or view service, that monitors all the other nodes in the system and decides
which ones of them have crashed.
Note that view server can be wrong about whether a node is actually up or down, but it doesn't matter. Other nodes will always take the view server's opinion
about which nodes are up as if it was the truth.
Once the view server has an idea about which nodes are up, it can select a primary and a backup and inform everyone (including clients).
If a node goes down, the view server will detect it and orchestrate a failover to replace that node.
Notice that this solves the split brain issue above, because now all decisions about failover are centralized at the view server. Yay!
Unfortunately, this approach also has a major drawback, which is that if the view service crashes, the system will not be able to tell who is the primary and who is the backup,
and no further failovers will be possible.
This is unsatisfying! But the protocol is already quite interesting and complex enough that it makes for excellent practice in lab 2.
In a few weeks, we will study Paxos, which is a fully general replication solution that can tolerate many node failures.
It is quite a bit more subtle and complicated than this protocol,
so warming up with a view-service-based primary-backup protocol in lab 2 will prepare you well for lab 3.
It's ok to be a little disappointed in this protocol! We promise it gets more satisfying when we get to Paxos.
For lab 2 and the rest of our discussion of primary-backup, we will assume that the view server never crashes.
That said, notice that this design is already quite a bit better than our single RPC server in lab 1
With only one server, if that server crashed, no client could make progress.
But with primary-backup, we will be able to tolerate either of those nodes crashing, and clients will still be able to make progress after failover.
If the view server crashes, then we won't be able to failover, but we've still tolerated crashes in all but one of our nodes!
Recap so far
The basic setup is that there will be a primary server and a backup
server working together to serve client requests. Here is the "normal case" workflow.
Clients send requests to the primary, who forwards them to the backup.
The backup executes the request and sends an acknowledgement to the primary.
The primary then executes the request and responds to the client.
This basic works fine as long as no (node) failures occur.
When either the primary or the backup fails, we need to do
"failover", meaning that we replace the failed server with another
available server.
An easy approach to failover is manual failover. The human operator of the
system can stop all incoming client requests. Manually figure out which
server crashed, decide what server to replace it with, reconfigure the
system to use the new server, make sure the state of both servers is up to
date and in sync, and then allow incoming client requests to resume
The downside of manual failover is that it is manual.
We started to discuss an automated failover solution that used a centralized
view service.
The view service will be a single node whose job is to decide what other
servers are up/down.
(Accurate failure detection is impossible in our fault model, so the view
service can and will be wrong about whether a server has crashed or not.
That turns out not to matter too much. What's most important is that, by
centralizing the fault detection in one node, we guarantee that nobody
will disagree about whether some node is up or down. The answer is
always just whatever the view service thinks, even if it is wrong.)
Extending the protocol with the view server
There are three kinds of nodes:
some number of clients
some number of servers
one view server
The clients want to submit Commands and get Results.
The servers are willing to host the Application and play the role of primary or backup.
The view service decides which servers are playing which role currently.
All servers send ping messages to the view server regularly
The view server keeps track of the set of servers who have pinged it recently and considers those servers to be up.
The view server selects a server from the set of "up" servers to serve as the primary and another to serve as the backup.
Later, if the primary or backup stops pinging the view server, then the view server will change its mind about who is primary and who is backup.
To support the view server changing its mind over time, we will introduce "views", whose name means "the view server's current view of who is primary and backup".
Views will have a version number called the view number that increments every time the view server wants to do a failover
When the view server increments the view number and selects a new primary and backup, this is called a "view change".
Since the primary and backup roles are played by different servers over time, the only way a client can know who the current primary is is to ask the view server.
Within a single view, there is at most one primary and at most backup.
A view contains three pieces of information
The view number (new views have higher view numbers)
The node who will play the role of primary in this view
The node who will play the role of backup in this view
When a client wants to submit a request, they first ask the view server for the current view
Then they go talk to the primary of that view
Clients are allowed to cache the view as long as they want, and keep talking to that primary.
If they don't hear from the primary after several retries, then they can go ask the view server whether there is a new view.
Overview of the full protocol
There are several different kinds of nodes in the system.
There is one view server that provides views to the rest of the system.
There are some number of clients that want to submit requests to the system.
There are some number of servers that are available to play the roles of
primary and backup (in a particular view) when asked to do so by the view
service.
To execute their requests, clients will follow the "normal case" workflow.
This requires that clients know who the primary is.
They will get this information by asking the view service.
Servers do what they are told by the view service. If a server \(S_1\) was
primary of view number 1, and then later it hears that the view service moved
to view number 2, then \(S_1\) stops doing what it was doing and moves into
the new view, taking whatever role assigned to it by the view service.
\(S_1\) does this even if the reason the view service moved into view 2
was because the view service thought that \(S_1\) was down even though it
was not. \(S_1\) does not try to "correct" the view service, but just does
what it was told.
(It's more important to have a consistent source of information about what
servers are up or down than it is to have correct information about
that. (Although the information should be right most of the time in order
for the system to make progress.))
The view service tracks which servers are up and assigns them roles in each view.
Scenarios
Here are a few scenarios that can happen during the lifetime of the system.
At the beginning of time, the view service has received no pings yet, so it
does not know if any servers are up. So there is no primary and no backup yet.
In the labs, this is view number 0, called the "startup view".
It is not a fully functional view: clients cannot execute requests because
there is no primary yet.
Every view numbered larger than 0 will have a primary.
As soon as the view service hears one ping, it will select that server as the
primary of view number 1, called the "initial view" in the labs.
There is no backup in view number 1.
In general, the view service is allowed to create views that have a
primary but no backup, if it does not have enough available servers to
find a backup.
When operating without a backup, the primary acts essentially like the RPC
server from lab 1. It just executes any client requests it receives and
responds to them.
In this scenario, if the primary fails, the system is stuck forever.
Later, after fully setting up view number 1, the view service will hopefully
have heard pings from at least one other server. It can then select one of
those servers to be backup.
Since the view server is updating who is backup, it must create a new view
by incrementing the view number again. It should not "update" the backup
for the already created view number 1.
In a typical view when everything is working well, there will be a primary and
a backup.
Clients can learn who the primary is by asking the view service.
Clients then submit their requests to that primary, who follows the
"normal case" workflow to execute those requests.
Things go on like this until some kind of failure occurs.
Failure scenarios
The backup fails
Suppose we are in view number 3 with a primary \(S_1\) and a backup \(S_2\) ,
and then \(S_2\) fails.
Clients can no longer successfully execute requests, because the primary needs
to talk to the backup before it executes each requests, but the backup is down.
The view service will detect that the backup is down because it stops pinging.
The view service initiates a view change by incrementing the view number to
4 and selecting a new primary and backup.
Since the primary did not fail, the view service leaves the primary as is,
so the primary for view number 4 is also \(S_1\).
The view service selects a new backup server from the set of available
servers. Let's say it selects \(S_3\), so \(S_3\) is the backup for view
number 4.
We want to start executing client requests again. To do that, we need to be
able to execute them both on the backup and on the primary, and we need to
know that those two applications are in sync with each other.
It's important that they are in sync so that if another failure occurs,
say the primary fails later, the backup can take over and return correct
answers.
The problem is that \(S_3\) was neither primary nor backup in the previous
view, so it does not have an up-to-date copy of the application. So it is
not ready to begin executing client requests.
In order to prepare the new backup, the primary needs to do a state
transfer, which involves sending a copy of the entire current application
state to the backup.
The backup replaces whatever application state it had with what the primary
sends it, and then acknowledges back to the primary that it has received the
state.
Now the primary can consider the view to be really started, and start
executing client requests.
Before the state transfer has been acknowledged, the primary cannot accept
any client requests.
Once the state is transferred to the backup, the protocol is fault tolerant again.
We need to tell the view service that this has happened! Otherwise, the
view service will not know that it is ok to do another failover if it
detects the new primary has crashed.
If, on the other hand, the new primary crashed before finishing the state
transfer, then the system is stuck. It cannot tolerate two node failures
that happen so soon one after the other that we didn't have a chance to
complete the state transfer.
We also need to tolerate message drops during the state transfer process.
The primary needs to retransmit the state transfer message until it gets
an acknowledgement.
We also need to tolerate message dups and delays during the state transfer process.
The backup should not perform a state transfer for an old view, nor for
the current view if it already has done a state transfer.
The primary fails
Consider a similar situation to the previous section.
Now suppose we are in view number 3 with a primary \(S_1\) and a backup \(S_2\),
and then \(S_1\) fails.
Clients can no longer successfully execute requests, because the primary is down.
The view service will detect that the primary is down because it stops pinging.
The view service initiates a view change by incrementing the view number to
4 and selecting a new primary and backup.
We know that the first thing the new primary is going to do is state transfer.
Therefore, in order not to lose committed data, the only reasonable choice
for the new primary is the old backup, since the old backup is the only
available server with an up-to-date copy of the application.
The key constraint is: if a client has received a response to a request in
the past, then we must guarantee that the new primary has an application
state that reflects the results of that request.
Again, the only way to satisfy this constraint is by selecting the old
backup as the new primary.
So the view service will select \(S_2\) as the new primary, and any other
available server, say \(S_3\) as the new backup.
The view service informs servers about the new view. The new primary will
initiate a state transfer as before. Once the state transfer is complete,
client requests can resume being executed.
The situation with no available servers
Now suppose we are in view number 3 with a primary \(S_1\) and a backup
\(S_2\), and that there are no other servers available. (There might be other
servers, but they already crashed or for some reason they cannot ping the view
service at the moment, so they are not available to be primary or backup.)
If either server fails, there is no server to replace it. What should we do?
One option would be to do nothing. Just refuse to execute client requests.
At least this doesn't return wrong answers! But it also doesn't make progress.
A slightly better approach is to move to a view where the one remaining
available server acts as primary on its own.
In this situation, we cannot tolerate any more crashes, but at least we
can make progress after the previous crash.
If we get lucky, and some other server that the view service thought was
down actually was ok and starts pinging again, then we can view change to
yet another view with a primary and a backup, restoring fault tolerance.
If the primary crashes during a view without a backup, then the system is
stuck forever, even if some other servers later become available.
We cannot use those nodes because they do not satisfy the constraint
that their application state reflects the results of all requests that
clients have received responses for.
The only way out of this situation is if the most recent primary
somehow comes back. The view service can then change to a view using
one of the new servers as the backup.
If the primary does not have a backup, then it executes client requests by
acting alone, similar to the RPC server from lab 1. It does not forward the
requests anywhere, but instead immediately executes them and responds to the
client.
Lab 2
We have presented an overview of the kind of primary-backup protocol we want
you to design in lab 2.
We have intentionally left out many details! These are for you to design and
fill in.
There are many different ways to get a correct protocol. There is not just
one right answer.
The act of taking high level ideas and turning them into concrete distributed
protocols that can be implemented, evaluated, and deployed is a core skill
that we hope you develop in this course!
Do not be surprised if the design process is harder than the implementation
process in labs 2 through 4!
Consistency Models and Linearizability
Agenda
Answer the question "What does it mean for a state-machine replication algorithm to be correct?"
Correctness of state-machine replication
It's actually sort of surprising that we haven't had to answer this question
yet!
Our hand wavy answer so far has been that the replicas are "equivalent" to one
machine from the clients perspective.
In other words, clients can pretend that there's just one copy of the
Application, that just happens to be very fault tolerant.
- There are some (mostly irrelevant) details here because clients have
to send their operations to the primary, which might change over time,
so they have to be aware that there are multiple servers. But this detail
is hidden by the client library that we, the authors of primary-backup write.
- So the application-layer's API to the state machine is actually the same as
in lab 1: sendCommand()/getResult().
If we look at state-machine replication at a high level, we have:
an opaque box in charge of the replication
clients send requests into the box and wait to get responses back
there can be multiple clients interacting with the system at once
The case of one client submitting one request at a time
If we think about the simplest case where we only have one client, and they
only send one request at a time, then the system should evolve "linearly":
the client retransmits its current request for ever until it gets a response
then it sends the next request, and so on
So in the situation with one client, the system is "equivalent" to executing
the clients' commands in order, starting from the initial state of the state
machine/Application.
Since the Application is (assumed to be) deterministic, there is only one
right answer for a particular sequence of commands, so this tells us whether
the system was correct or not (did it return the right answers?)
The case of two client each submitting one request
Now consider the case where there are two clients, and they each submit a request.
These requests can arrive at servers in the system in either order.
Do we need to try to reorder them somehow?
Intuitively, no. Either order is ok.
The reason is that the clients are prepared to handle delays in the
network, so they cannot possibly be expecting one to definitely arrive
first. So we are free to execute them in either order.
Consistency models
The above examples show us the flavor of a consistency model.
Given an execution of a distributed system, a consistency model says
whether or not it is "correct".
Different consistency models have different definitions of correct.
For this discussion, we are going to consider two forms of execution
The bird's-eye space-time diagram model
The request-response-execution model
Bird's-eye space-time diagrams
We have seen a few of these before.
Here we'll use a simplified version of this kind of diagram, where we only
draw clients, not servers.
Instead of showing requests going into the state-machine replication box and
responses coming out, we will abbreviate this using the "regions of time" notation.
A region begins when the client first sends the request to the system. Labeled with the request.
A region ends when the client first receives the response to the current
request (clients only have one outstanding request). Labeled with the response.
We represent the timing information visually in the picture, but you could
equivalently think of it as labeling the beginning and end of each region with
a bird's-eye timestamp.
Request-response execution model
A list of all client requests and their response in event order.
So a consistency model takes one of these space-time diagrams as input and
tells you either "yes, that is allowed" or "no, that is not allowed".
Sequential consistency
This is a consistency model with this definition:
there is a global order on requests;
responses are computed as if requests executed in that global order; and,
the global order agrees with the client's local order
Linearizability
This is a consistency model with this definition:
Everything from sequential consistency; and
If request r2 is submitted after response to r1, then r1 appears
before r2 in the global order.