Lecture 3: Primary-Backup — Notes

These are notes from the lecture on April 3, 2026. See also the whiteboard descriptions and the whiteboard PDF.

These materials were drafted by AI based on the live whiteboard PDF and audio transcript from the corresponding lecture and then reviewed and edited by course staff. They may contain errors. Please let us know if you spot any.

RPC Protocol Variants

Before diving into the next topic, a note on vocabulary used in the problem set. There are four named variants of the RPC protocol:

Naive RPC — send request, wait for response. No fault tolerance at all. The client can easily get confused if a duplicate old response is re-delivered, because the client will think it was a response to its most recent request. (No id to tell them apart.)
Naive RPC with request IDs (called "naive RPC + sequence numbers" in lecture) — add unique identifiers to messages so the client can match responses to requests.
At-least-once RPC — naive RPC with request IDs plus retransmission. The client retransmits on timeout. Guarantees the request executes at least once (assuming not every message is dropped forever), but may execute more than once.
At-most-once RPC — naive RPC with request IDs plus server-side deduplication (but no retransmission). Guarantees each request executes at most once.
Exactly-once RPC — retransmission plus deduplication. This is the protocol from the end of lecture 2.

And note that "at-least once" and "exactly once" are slight misnomers, because if the network drops all messages forever, then these protocols just run forever and never actually execute any requests. This is the best we can do in such a situation.

Tangent: One way to work around this issue theoretically is to assume what is called network fairness: you can't drop the same message forever. You can drop arbitrarily many messages, but if a message is retransmitted indefinitely, eventually one copy gets through. Then we can say: in any fair execution, the request is executed at least once (or exactly once depending on the protocol).

The Problem: Node Failures

Our exactly-once RPC protocol handles network failures but not node failures. If the server crashes, the entire system stops — the server was the only machine with the key-value store, so it's simply gone.

Replication

To tolerate node failures, every critical task must be doable by more than one machine. If your protocol depends on something happening and only one machine can do it, you don't tolerate that machine's failure.

The naive approach (and why it fails)

A first attempt: just have two key-value stores and let the client use whichever one is available. This doesn't work because the two stores have different data. When the first one crashes and the client switches to the second, it starts from scratch — all data is lost.

Having a second key-value store is only useful if it has the same data. This is the core challenge of replication.

Client sends requests to two replicas KV1 and KV2, which must execute the same operations in the same order

Keeping replicas in sync

We can't update two machines at exactly the same time — we're communicating over an unreliable network with no synchronized clocks. So we need a protocol that ensures both replicas execute the same operations in the same order. If they start in the same state and execute the same deterministic operations in the same order, they'll always be in the same state (or one will be slightly ahead of the other).

Deterministic State Machines

A deterministic state machine has:

States (possibly infinitely many — e.g., all possible hash maps)
Inputs (called Command in the labs)
Outputs (called Result in the labs)

The key property: determinism — if two state machines are in the same state and you execute the same command, they produce the same output and transition to the same next state.

Key-value stores are deterministic state machines. The commands are get, put, and append with their arguments. The state is the hash map. This is what makes replication work — doing the same operations in the same order in two places yields the same result.

Most real programs are not deterministic (random number generators, reading the current time, etc.). If you want to use replication, the replicated component must be deterministic.

In the labs, the deterministic state machine is represented by the Application class.

Primary-Backup

Why not let clients decide the request order?

With one client, the client could send each request (with a sequence number) to both replicas and they'd stay in sync. But with multiple clients, the server needs to resolve ordering across different clients' sequence number spaces. So the ordering responsibility cannot easily be delegated to clients.

The primary-backup idea

Designate one replica as the primary and the other as the backup. The primary is in charge of:

Deciding the order of requests (when multiple clients send requests concurrently)
Telling the backup what to do (forwarding requests with sequence numbers)

Normal operation (executing a request)

Client sends request to the primary
Primary assigns a sequence number and forwards the request to the backup
Backup executes the request on its copy of the state machine (throws away the output), then sends an acknowledgment back to the primary with the sequence number
Primary executes the request on its copy, takes the output, and sends the response back to the client

Primary-backup protocol: clients send requests to primary, which forwards to backup before responding

The primary-backup communication is essentially RPC — all the same mechanisms are needed (retransmission, deduplication, response retransmission for duplicate requests).

Two levels of sequence numbers

There are two sequence number spaces: the client sequence numbers (between client and primary, for the client's RPC) and the replication sequence numbers (between primary and backup, for ordering requests across all clients). Both are necessary.

Why the backup executes first

The primary must wait for the backup's acknowledgment before responding to the client. If the primary executed and responded immediately, it would have no confirmation that the backup even received the request — the backup might have failed, or the primary might no longer be the primary (once we talk about failover -- replacing the primary).

The AMO application

In the labs, the AMOApplication class wraps an Application and handles duplicate detection and response storage (the at-most-once logic). The backup stores responses in its AMOApplication, so if the primary fails and the client retransmits to the backup, it can return the stored response.

Replication overhead

Yes, every operation is computed twice — that's the fundamental cost of replication. If half your machines are replicas, you lose at least a factor of two in performance. This is the inherent tension between fault tolerance (replication) and performance (horizontal scaling).

Performance from the client's perspective

The client doesn't receive a response until: the primary contacts the backup → the backup executes the request → the backup replies → the primary executes the request → the primary sends the response. This is slower than single-server RPC.

Failover

What happens when a node fails? We need failover: replacing one node with another in some role (e.g., the primary fails and the backup takes over).

Failure detection is impossible

Ideally, we'd first detect that a node has failed, then fail over. But in the standard fault model, failure detection is impossible. The reason: if you send a node a thousand messages and get no response, that could be because the node crashed or because the network is dropping/delaying all your messages. You can't distinguish these two cases.

This is a fundamental limitation. The solution (covered in lecture 4) requires a different approach than "detect then failover."