Lecture 7: More Primary-backup

Agenda

Finish our discussion of primary-backup replication

Recap from last time

We started discussing primary-backup replication.
The basic setup is that there will be a primary server and a backup server working together to serve client requests. Here is the "normal case" workflow.
- Clients send requests to the primary, who forwards them to the backup.
- The backup executes the request and sends an acknowledgement to the primary.
- The primary then executes the request and responds to the client.
This basic works fine as long as no (node) failures occur.
When either the primary or the backup fails, we need to do "failover", meaning that we replace the failed server with another available server.
- An easy approach to failover is manual failover. The human operator of the system can stop all incoming client requests. Manually figure out which server crashed, decide what server to replace it with, reconfigure the system to use the new server, make sure the state of both servers is up to date and in sync, and then allow incoming client requests to resume
- The downside of manual failover is that it is manual.
We started to discuss an automated failover solution that used a centralized view service.
- The view service will be a single node whose job is to decide what other servers are up/down.
- (Accurate failure detection is impossible in our fault model, so the view service can and will be wrong about whether a server has crashed or not. That turns out not to matter too much. What's most important is that, by centralizing the fault detection in one node, we guarantee that nobody will disagree about whether some node is up or down. The answer is always just whatever the view service thinks, even if it is wrong.)

Extending the protocol with the view server

There are three kinds of nodes:
- some number of clients
- some number of servers
- one view server
The clients want to submit Commands and get Results.
The servers are willing to host the Application and play the role of primary or backup.
The view service decides which servers are playing which role currently.

All servers send ping messages to the view server regularly
The view server keeps track of the set of servers who have pinged it recently and considers those servers to be up.
The view server selects a server from the set of "up" servers to serve as the primary and another to serve as the backup.
Later, if the primary or backup stops pinging the view server, then the view server will change its mind about who is primary and who is backup.
To support the view server changing its mind over time, we will introduce "views", whose name means "the view server's current view of who is primary and backup".
- Views will have a version number called the view number that increments every time the view server wants to do a failover
- When the view server increments the view number and selects a new primary and backup, this is called a "view change".
Since the primary and backup roles are played by different servers over time, the only way a client can know who the current primary is is to ask the view server.
Within a single view, there is at most one primary and at most backup.

A view contains three pieces of information
- The view number (new views have higher view numbers)
- The node who will play the role of primary in this view
- The node who will play the role of backup in this view
When a client wants to submit a request, they first ask the view server for the current view
- Then they go talk to the primary of that view
- Clients are allowed to cache the view as long as they want, and keep talking to that primary.
  - If they don't hear from the primary after several retries, then they can go ask the view server whether there is a new view.

Overview of the full protocol

There are several different kinds of nodes in the system.
- There is one view server that provides views to the rest of the system.
- There are some number of clients that want to submit requests to the system.
- There are some number of servers that are available to play the roles of primary and backup (in a particular view) when asked to do so by the view service.
To execute their requests, clients will follow the "normal case" workflow.
- This requires that clients know who the primary is.
- They will get this information by asking the view service.
Servers do what they are told by the view service. If a server \(S_1\) was primary of view number 1, and then later it hears that the view service moved to view number 2, then \(S_1\) stops doing what it was doing and moves into the new view, taking whatever role assigned to it by the view service.
- \(S_1\) does this even if the reason the view service moved into view 2 was because the view service thought that \(S_1\) was down even though it was not. \(S_1\) does not try to "correct" the view service, but just does what it was told.
- (It's more important to have a consistent source of information about what servers are up or down than it is to have correct information about that. (Although the information should be right most of the time in order for the system to make progress.))
The view service tracks which servers are up and assigns them roles in each view.

Scenarios

Here are a few scenarios that can happen during the lifetime of the system.

At the beginning of time, the view service has received no pings yet, so it does not know if any servers are up. So there is no primary and no backup yet.
- In the labs, this is view number 0, called the "startup view".
- It is not a fully functional view: clients cannot execute requests because there is no primary yet.
- Every view numbered larger than 0 will have a primary.
As soon as the view service hears one ping, it will select that server as the primary of view number 1, called the "initial view" in the labs.
- There is no backup in view number 1.
- In general, the view service is allowed to create views that have a primary but no backup, if it does not have enough available servers to find a backup.
- When operating without a backup, the primary acts essentially like the RPC server from lab 1. It just executes any client requests it receives and responds to them.
- In this scenario, if the primary fails, the system is stuck forever.
Later, after fully setting up view number 1, the view service will hopefully have heard pings from at least one other server. It can then select one of those servers to be backup.
- Since the view server is updating who is backup, it must create a new view by incrementing the view number again. It should not "update" the backup for the already created view number 1.
In a typical view when everything is working well, there will be a primary and a backup.
- Clients can learn who the primary is by asking the view service.
- Clients then submit their requests to that primary, who follows the "normal case" workflow to execute those requests.
- Things go on like this until some kind of failure occurs.

Failure scenarios

The backup fails

Suppose we are in view number 3 with a primary \(S_1\) and a backup \(S_2\) , and then \(S_2\) fails.
Clients can no longer successfully execute requests, because the primary needs to talk to the backup before it executes each requests, but the backup is down.
The view service will detect that the backup is down because it stops pinging.
The view service initiates a view change by incrementing the view number to 4 and selecting a new primary and backup.
- Since the primary did not fail, the view service leaves the primary as is, so the primary for view number 4 is also \(S_1\).
- The view service selects a new backup server from the set of available servers. Let's say it selects \(S_3\), so \(S_3\) is the backup for view number 4.
We want to start executing client requests again. To do that, we need to be able to execute them both on the backup and on the primary, and we need to know that those two applications are in sync with each other.
- It's important that they are in sync so that if another failure occurs, say the primary fails later, the backup can take over and return correct answers.
The problem is that \(S_3\) was neither primary nor backup in the previous view, so it does not have an up-to-date copy of the application. So it is not ready to begin executing client requests.
In order to prepare the new backup, the primary needs to do a state transfer, which involves sending a copy of the entire current application state to the backup.
The backup replaces whatever application state it had with what the primary sends it, and then acknowledges back to the primary that it has received the state.
Now the primary can consider the view to be really started, and start executing client requests.
- Before the state transfer has been acknowledged, the primary cannot accept any client requests.
Once the state is transferred to the backup, the protocol is fault tolerant again.
- We need to tell the view service that this has happened! Otherwise, the view service will not know that it is ok to do another failover if it detects the new primary has crashed.
- If, on the other hand, the new primary crashed before finishing the state transfer, then the system is stuck. It cannot tolerate two node failures that happen so soon one after the other that we didn't have a chance to complete the state transfer.
We also need to tolerate message drops during the state transfer process.
- The primary needs to retransmit the state transfer message until it gets an acknowledgement.
We also need to tolerate message dups and delays during the state transfer process.
- The backup should not perform a state transfer for an old view, nor for the current view if it already has done a state transfer.

The primary fails

Consider a similar situation to the previous section.

Now suppose we are in view number 3 with a primary \(S_1\) and a backup \(S_2\), and then \(S_1\) fails.
Clients can no longer successfully execute requests, because the primary is down.
The view service will detect that the primary is down because it stops pinging.
The view service initiates a view change by incrementing the view number to 4 and selecting a new primary and backup.
We know that the first thing the new primary is going to do is state transfer.
- Therefore, in order not to lose committed data, the only reasonable choice for the new primary is the old backup, since the old backup is the only available server with an up-to-date copy of the application.
- The key constraint is: if a client has received a response to a request in the past, then we must guarantee that the new primary has an application state that reflects the results of that request.
- Again, the only way to satisfy this constraint is by selecting the old backup as the new primary.
So the view service will select \(S_2\) as the new primary, and any other available server, say \(S_3\) as the new backup.
The view service informs servers about the new view. The new primary will initiate a state transfer as before. Once the state transfer is complete, client requests can resume being executed.

The situation with no available servers

Now suppose we are in view number 3 with a primary \(S_1\) and a backup \(S_2\), and that there are no other servers available. (There might be other servers, but they already crashed or for some reason they cannot ping the view service at the moment, so they are not available to be primary or backup.)
If either server fails, there is no server to replace it. What should we do?
- One option would be to do nothing. Just refuse to execute client requests.
- At least this doesn't return wrong answers! But it also doesn't make progress.
A slightly better approach is to move to a view where the one remaining available server acts as primary on its own.
- In this situation, we cannot tolerate any more crashes, but at least we can make progress after the previous crash.
- If we get lucky, and some other server that the view service thought was down actually was ok and starts pinging again, then we can view change to yet another view with a primary and a backup, restoring fault tolerance.
- If the primary crashes during a view without a backup, then the system is stuck forever, even if some other servers later become available.
  - We cannot use those nodes because they do not satisfy the constraint that their application state reflects the results of all requests that clients have received responses for.
  - The only way out of this situation is if the most recent primary somehow comes back. The view service can then change to a view using one of the new servers as the backup.
If the primary does not have a backup, then it executes client requests by acting alone, similar to the RPC server from lab 1. It does not forward the requests anywhere, but instead immediately executes them and responds to the client.

Lab 2

We have presented an overview of the kind of primary-backup protocol we want you to design in lab 2.
We have intentionally left out many details! These are for you to design and fill in.
- There are many different ways to get a correct protocol. There is not just one right answer.
The act of taking high level ideas and turning them into concrete distributed protocols that can be implemented, evaluated, and deployed is a core skill that we hope you develop in this course!
Do not be surprised if the design process is harder than the implementation process in labs 2 through 4!

Design doc for lab 2

You will write a design doc (with your partner, if applicable) for lab 2 based on the high-level description from lecture.
Follow the template.
- We have made a few minor changes to the template to help with common errors based on the lab 1 design docs. Be sure to check out the new version.
- You can submit your design doc in any readable format as a PDF on gradescope.
  - Handwritten is fine, so is plain text ascii, word doc, \(\LaTeX\), or anything else.
- Be sure these sections appear explicitly with exactly these titles in large font in your doc
  - Preface
  - Protocol
  - Analysis
  - Conclusion
- Besides those section titles, you do not have to follow the rest of the template exactly, but you must include the spirit of all the information requested in the template in some format.
- Feel free to include diagrams or informal discussion wherever you see fit.
Your design doc should fill in all missing details that are "distributed design decisions".
- You should be specific about state, messages, timers. We were not specific in lecture.
- Try to avoid low-level Java-specific details where possible. Your document should make sense in any programming language.
New! You should document the invariants and stable properties of your protocol in the analysis section.
Be sure to incorporate any relevant feedback you got on your lab 1 design doc.
For lab 2, you can omit the subsection on Performance from the Analysis section.