Lecture 7: More Primary-backup
- Finish our discussion of primary-backup replication
Recap from last time
- We started discussing primary-backup replication.
- The basic setup is that there will be a primary server and a backup
server working together to serve client requests. Here is the "normal case" workflow.
- Clients send requests to the primary, who forwards them to the backup.
- The backup executes the request and sends an acknowledgement to the primary.
- The primary then executes the request and responds to the client.
- This basic works fine as long as no (node) failures occur.
- When either the primary or the backup fails, we need to do
"failover", meaning that we replace the failed server with another
- An easy approach to failover is manual failover. The human operator of the
system can stop all incoming client requests. Manually figure out which
server crashed, decide what server to replace it with, reconfigure the
system to use the new server, make sure the state of both servers is up to
date and in sync, and then allow incoming client requests to resume
- The downside of manual failover is that it is manual.
- We started to discuss an automated failover solution that used a centralized
- The view service will be a single node whose job is to decide what other
servers are up/down.
- (Accurate failure detection is impossible in our fault model, so the view
service can and will be wrong about whether a server has crashed or not.
That turns out not to matter too much. What's most important is that, by
centralizing the fault detection in one node, we guarantee that nobody
will disagree about whether some node is up or down. The answer is
always just whatever the view service thinks, even if it is wrong.)
Extending the protocol with the view server
- There are three kinds of nodes:
- some number of clients
- some number of servers
- one view server
- The clients want to submit
Commands and get
- The servers are willing to host the
Application and play the role of primary or backup.
- The view service decides which servers are playing which role currently.
- All servers send ping messages to the view server regularly
- The view server keeps track of the set of servers who have pinged it recently and considers those servers to be up.
- The view server selects a server from the set of "up" servers to serve as the primary and another to serve as the backup.
- Later, if the primary or backup stops pinging the view server, then the view server will change its mind about who is primary and who is backup.
- To support the view server changing its mind over time, we will introduce "views", whose name means "the view server's current view of who is primary and backup".
- Views will have a version number called the view number that increments every time the view server wants to do a failover
- When the view server increments the view number and selects a new primary and backup, this is called a "view change".
- Since the primary and backup roles are played by different servers over time, the only way a client can know who the current primary is is to ask the view server.
- Within a single view, there is at most one primary and at most backup.
- A view contains three pieces of information
- The view number (new views have higher view numbers)
- The node who will play the role of primary in this view
- The node who will play the role of backup in this view
- When a client wants to submit a request, they first ask the view server for the current view
- Then they go talk to the primary of that view
- Clients are allowed to cache the view as long as they want, and keep talking to that primary.
- If they don't hear from the primary after several retries, then they can go ask the view server whether there is a new view.
Overview of the full protocol
- There are several different kinds of nodes in the system.
- There is one view server that provides views to the rest of the system.
- There are some number of clients that want to submit requests to the system.
- There are some number of servers that are available to play the roles of
primary and backup (in a particular view) when asked to do so by the view
- To execute their requests, clients will follow the "normal case" workflow.
- This requires that clients know who the primary is.
- They will get this information by asking the view service.
- Servers do what they are told by the view service. If a server \(S_1\) was
primary of view number 1, and then later it hears that the view service moved
to view number 2, then \(S_1\) stops doing what it was doing and moves into
the new view, taking whatever role assigned to it by the view service.
- \(S_1\) does this even if the reason the view service moved into view 2
was because the view service thought that \(S_1\) was down even though it
was not. \(S_1\) does not try to "correct" the view service, but just does
what it was told.
- (It's more important to have a consistent source of information about what
servers are up or down than it is to have correct information about
that. (Although the information should be right most of the time in order
for the system to make progress.))
- The view service tracks which servers are up and assigns them roles in each view.
Here are a few scenarios that can happen during the lifetime of the system.
- At the beginning of time, the view service has received no pings yet, so it
does not know if any servers are up. So there is no primary and no backup yet.
- In the labs, this is view number 0, called the "startup view".
- It is not a fully functional view: clients cannot execute requests because
there is no primary yet.
- Every view numbered larger than 0 will have a primary.
- As soon as the view service hears one ping, it will select that server as the
primary of view number 1, called the "initial view" in the labs.
- There is no backup in view number 1.
- In general, the view service is allowed to create views that have a
primary but no backup, if it does not have enough available servers to
find a backup.
- When operating without a backup, the primary acts essentially like the RPC
server from lab 1. It just executes any client requests it receives and
responds to them.
- In this scenario, if the primary fails, the system is stuck forever.
- Later, after fully setting up view number 1, the view service will hopefully
have heard pings from at least one other server. It can then select one of
those servers to be backup.
- Since the view server is updating who is backup, it must create a new view
by incrementing the view number again. It should not "update" the backup
for the already created view number 1.
- In a typical view when everything is working well, there will be a primary and
- Clients can learn who the primary is by asking the view service.
- Clients then submit their requests to that primary, who follows the
"normal case" workflow to execute those requests.
- Things go on like this until some kind of failure occurs.
The backup fails
- Suppose we are in view number 3 with a primary \(S_1\) and a backup \(S_2\) ,
and then \(S_2\) fails.
- Clients can no longer successfully execute requests, because the primary needs
to talk to the backup before it executes each requests, but the backup is down.
- The view service will detect that the backup is down because it stops pinging.
- The view service initiates a view change by incrementing the view number to
4 and selecting a new primary and backup.
- Since the primary did not fail, the view service leaves the primary as is,
so the primary for view number 4 is also \(S_1\).
- The view service selects a new backup server from the set of available
servers. Let's say it selects \(S_3\), so \(S_3\) is the backup for view
- We want to start executing client requests again. To do that, we need to be
able to execute them both on the backup and on the primary, and we need to
know that those two applications are in sync with each other.
- It's important that they are in sync so that if another failure occurs,
say the primary fails later, the backup can take over and return correct
- The problem is that \(S_3\) was neither primary nor backup in the previous
view, so it does not have an up-to-date copy of the application. So it is
not ready to begin executing client requests.
- In order to prepare the new backup, the primary needs to do a state
transfer, which involves sending a copy of the entire current application
state to the backup.
- The backup replaces whatever application state it had with what the primary
sends it, and then acknowledges back to the primary that it has received the
- Now the primary can consider the view to be really started, and start
executing client requests.
- Before the state transfer has been acknowledged, the primary cannot accept
any client requests.
- Once the state is transferred to the backup, the protocol is fault tolerant again.
- We need to tell the view service that this has happened! Otherwise, the
view service will not know that it is ok to do another failover if it
detects the new primary has crashed.
- If, on the other hand, the new primary crashed before finishing the state
transfer, then the system is stuck. It cannot tolerate two node failures
that happen so soon one after the other that we didn't have a chance to
complete the state transfer.
- We also need to tolerate message drops during the state transfer process.
- The primary needs to retransmit the state transfer message until it gets
- We also need to tolerate message dups and delays during the state transfer process.
- The backup should not perform a state transfer for an old view, nor for
the current view if it already has done a state transfer.
The primary fails
Consider a similar situation to the previous section.
- Now suppose we are in view number 3 with a primary \(S_1\) and a backup \(S_2\),
and then \(S_1\) fails.
- Clients can no longer successfully execute requests, because the primary is down.
- The view service will detect that the primary is down because it stops pinging.
- The view service initiates a view change by incrementing the view number to
4 and selecting a new primary and backup.
- We know that the first thing the new primary is going to do is state transfer.
- Therefore, in order not to lose committed data, the only reasonable choice
for the new primary is the old backup, since the old backup is the only
available server with an up-to-date copy of the application.
- The key constraint is: if a client has received a response to a request in
the past, then we must guarantee that the new primary has an application
state that reflects the results of that request.
- Again, the only way to satisfy this constraint is by selecting the old
backup as the new primary.
- So the view service will select \(S_2\) as the new primary, and any other
available server, say \(S_3\) as the new backup.
- The view service informs servers about the new view. The new primary will
initiate a state transfer as before. Once the state transfer is complete,
client requests can resume being executed.
The situation with no available servers
- Now suppose we are in view number 3 with a primary \(S_1\) and a backup
\(S_2\), and that there are no other servers available. (There might be other
servers, but they already crashed or for some reason they cannot ping the view
service at the moment, so they are not available to be primary or backup.)
- If either server fails, there is no server to replace it. What should we do?
- One option would be to do nothing. Just refuse to execute client requests.
- At least this doesn't return wrong answers! But it also doesn't make progress.
- A slightly better approach is to move to a view where the one remaining
available server acts as primary on its own.
- In this situation, we cannot tolerate any more crashes, but at least we
can make progress after the previous crash.
- If we get lucky, and some other server that the view service thought was
down actually was ok and starts pinging again, then we can view change to
yet another view with a primary and a backup, restoring fault tolerance.
- If the primary crashes during a view without a backup, then the system is
stuck forever, even if some other servers later become available.
- We cannot use those nodes because they do not satisfy the constraint
that their application state reflects the results of all requests that
clients have received responses for.
- The only way out of this situation is if the most recent primary
somehow comes back. The view service can then change to a view using
one of the new servers as the backup.
- If the primary does not have a backup, then it executes client requests by
acting alone, similar to the RPC server from lab 1. It does not forward the
requests anywhere, but instead immediately executes them and responds to the
- We have presented an overview of the kind of primary-backup protocol we want
you to design in lab 2.
- We have intentionally left out many details! These are for you to design and
- There are many different ways to get a correct protocol. There is not just
one right answer.
- The act of taking high level ideas and turning them into concrete distributed
protocols that can be implemented, evaluated, and deployed is a core skill
that we hope you develop in this course!
- Do not be surprised if the design process is harder than the implementation
process in labs 2 through 4!
Design doc for lab 2
- You will write a design doc (with your partner, if applicable) for lab 2 based
on the high-level description from lecture.
- Follow the template.
- We have made a few minor changes to the template to help with common
errors based on the lab 1 design docs. Be sure to check out the new version.
- You can submit your design doc in any readable format as a PDF on gradescope.
- Handwritten is fine, so is plain text ascii, word doc, \(\LaTeX\), or anything else.
- Be sure these sections appear explicitly with exactly these titles in large font in your doc
- Besides those section titles, you do not have to follow the rest of
the template exactly, but you must include the spirit of all the
information requested in the template in some format.
- Feel free to include diagrams or informal discussion wherever you see fit.
- Your design doc should fill in all missing details that are "distributed
- You should be specific about state, messages, timers. We were not specific
- Try to avoid low-level Java-specific details where possible. Your document
should make sense in any programming language.
- New! You should document the invariants and stable properties of your protocol
in the analysis section.
- Be sure to incorporate any relevant feedback you got on your lab 1 design doc.
- For lab 2, you can omit the subsection on Performance from the Analysis section.