Lecture 4: More Primary-Backup — Notes

These are notes from the lecture on April 6, 2026. See also the whiteboard descriptions and the whiteboard PDF.

These materials were drafted by AI based on the live whiteboard PDF and audio transcript from the corresponding lecture and then reviewed and edited by course staff. They may contain errors. Please let us know if you spot any.

Why Not TCP?

First, a quick aside about TCP.

Some of you may be familiar with the networking protocol TCP. TCP provides ordered, reliable message delivery. Underneath, TCP handles retransmission, sequence numbers, and tolerates reordering, drops, and duplicates. These are exactly the problems we solved in lab 1! So a natural question is: can we just use TCP to implement RPC?

No, not really. The problem is reconnections. If for any reason the TCP connection gets closed (a router loses power, your connection drops/timeouts, the client switches from cell data to wifi, etc.), then you lose access to state of the TCP connection. When you reconnect, you have a fresh TCP connection with no memory of the previous one. At that point, the client doesn't know whether its last request was executed or not. This is exactly the same problem as with naive RPC!

So even with TCP, you need application-level sequence numbers and all the duplicate detection and retransmission logic from lab 1. TCP solves the problem on a per-connection basis, but not across connections. One way to view this is that we can use TCP as a performance optimization. In practice, we can and should use TCP and implement our own application-level logic to handle reconnections. TCP will likely perform better than relying on application-level logic to handle all network blips, and we get the benefits of congestion control.

Recap: Primary-Backup So Far

Ok, so back to the main narrative.

Last lecture established the basic primary-backup protocol:

Client sends request to the primary
Primary assigns a sequence number, forwards to the backup
Backup executes the request and acknowledges
Primary executes the request and responds to the client

This protocol keeps two replicas in sync using the deterministic state machine property. But it doesn't tolerate any machine failures. If either the primary or the backup crashes, the system is stuck. In some sense, it's worse than single-server RPC: slower (every operation goes through two machines) but with no additional fault tolerance.

Approximate Failure Detection

We established last time that perfect failure detection is impossible in an unreliable network. You can't distinguish a crashed node from one whose messages are all being dropped. The solution: accept that failure detection will be approximate, and design the protocol to work correctly even when the detector is wrong.

Approximate failure detection can be wrong in both directions:

False positive: think a node has failed when it hasn't (because messages were delayed or dropped)
False negative: think a node is alive when it has actually crashed (because the network delivered a duplicate of an old "yes I'm alive" response)

A Terrible Idea: Mutual Monitoring

Here is a first attempt at handling failures. If we want to handle the case where the backup fails, we can have the primary monitor the backup. If the primary decides the backup has failed, the primary can unilaterally revert to single-server mode and continue executing client requests.

Similarly, if we want to handle the case where the primary fails, we can have the backup monitor the primary. If the backup decides the primary has failed, then the backup can unilaterally revert to single-server mode, notify clients (somehow), and continue executing client requests.

Either of these on its own might be somewhat reasonable. Together, they're a really bad idea. Both nodes can simultaneously conclude that the other has failed, for example if the network fails between them while both machines are fine. Both declare themselves the single server, clients split between them, and the two copies of the key-value store diverge. This is the split-brain problem: two subsystems independently claiming to act on behalf of the entire system, making contradictory decisions.

The fundamental tension: failure detection of a node cannot run on that node. But if you run failure detection on multiple nodes, they can reach contradictory conclusions. The correct solution (lab 3, Paxos) uses majority voting among three or more servers to reach agreement.

The View Server

For lab 2, we use a simpler approach instead: a single dedicated failure detector node. Obviously if the failure detector node itself fails, then the system gets stuck. But if any other node fails, then the failure detector node can try to observe this and help the system recover. This is strictly better than lab 1, though still not perfectly fault tolerant.

In the lab 2 protocol, the failure detector does more than just detect failures. It decides who the current primary and backup are (called the current "view"), and broadcasts this information to all nodes. So we rename it the view server.

Views

A view consists of three pieces of information:

View number - a monotonically increasing version number
Primary - which server is currently the primary
Backup - which server is currently the backup

When the view server changes its mind about who should be primary or backup, it increments the view number and publishes a new view. Nodes track the highest view number they've seen and ignore any view with a smaller number. This prevents stale, redelivered messages from overwriting current information.

View transitions: View 1 has P=S1, B=S2; after S2 fails, View 2 has P=S1, B=S3 with state transfer

This is a general pattern in distributed systems: any time information changes over time and is communicated over an unreliable network, attach a monotonically increasing version number. You can't rely on message arrival order to determine which information is most recent.

Pings

The view server performs approximate failure detection by collecting ping messages from the servers. If a server stops pinging for long enough, the view server considers it failed. The view server also responds to queries about the current view.

Idle Servers and Replacement

Rather than reverting to single-server mode when a node fails, we keep a pool of idle servers that can replace failed nodes. These don't participate in normal operation. They are spare capacity (or, in the cloud, machines you can provision on demand).

When the backup fails: the view server selects an idle server as the new backup, increments the view number, and publishes the new view. The primary remains the same.

When the primary fails: the view server promotes the backup to primary and selects an idle server as the new backup. The primary is not replaced with an idle server directly. This is so that state transfer (see below) is always from the new primary to the new backup.

If both the primary and backup fail simultaneously, all data is lost (assuming in-memory storage). This is an inherent limitation of this protocol.

Limitations

This protocol has several failure scenarios it cannot handle:

The view server itself fails. Then everything stops.
Both the primary and backup fail at the same time. Data is lost and not recoverable.
All idle servers are exhausted. Then we can fall back to single-server mode to keep executing requests, but we won't be able to tolerate additional failures until more servers come online.

Some of these are addressed by lab 3 (Paxos). Others are fundamental to distributed systems. If every machine in your system crashes simultaneously, then there's nothing you can do.

State Transfer

When a new backup is selected, it has no data. The whole point of a backup is to have a second copy of the data, so the new backup first needs to receive the current state.

State transfer is the process of sending the entire key-value store from the primary to the new backup. In the labs, this is done in a single message. (In practice, you'd probably want to send the data in chunks.)

The state transfer protocol

The view server publishes a new view with the new backup
The primary (which is the same in both the old and new views, if the backup was the one that failed) receives the new view
The primary sends a state transfer message containing its entire database to the new backup
The new backup overwrites its state with the received data and sends an acknowledgment
Only after receiving the acknowledgment does the primary resume executing client requests

Why the primary must finish state transfer before executing client requests

The primary cannot execute client requests during state transfer. If it forwarded a request to the new backup before state transfer completed, message reordering could cause the request to arrive first. The backup would execute it against an empty database, diverging from the primary. The primary must wait for the state transfer acknowledgment before resuming normal operation.

Why the backup is promoted on primary failure

When the primary fails, why promote the old backup rather than picking an idle server as the new primary? Because after a primary failure, the only surviving copy of the data is on the old backup. That node needs to do state transfer to the new backup, and state transfer always goes from primary to backup. By promoting the old backup to primary, the node with the data is always the one doing state transfer. The protocol is simpler if the state transfer always goes from the new primary to the new backup.

Failure during state transfer

If a node fails during state transfer (before the acknowledgment is received by the view server), the protocol gets stuck. Having the view server switch to yet another new backup before confirming state transfer completed is dangerous, since the state transfer may have in fact completed but the view server just doesn't know it yet, in which case the primary and backup are potentially executing client requests. On the other hand, the state transfer may not have completed, in which case it is unsafe to promote the backup in a future view, since it doesn't have the state. So while state transfer is happening, the protocol is "vulnerable" to additional failures, which cannot be tolerated. (We will fix this in lab 3.)