Lecture 12: MultiPaxos — Notes

These are notes from the lecture on April 24, 2026. See also the whiteboard descriptions and the whiteboard PDF.

These materials were drafted by AI based on the live whiteboard PDF and audio transcript from the corresponding lecture and then reviewed and edited by course staff. They may contain errors. Please let us know if you spot any.

From Single-Decree Paxos to a Replicated Log

Single-decree Paxos chooses one value. To build a fault-tolerant key-value store, we need to agree on a whole sequence of operations. MultiPaxos does this by running Paxos once per slot, where a slot is an index into a replicated log. All servers keep a local copy of the log, and all servers maintain a copy of the key-value store that they update by executing the operations in order.

A slot is chosen once Paxos reaches consensus on what operation goes there. Visually, a green checkmark in a slot means consensus has been reached; an empty slot means it has not. The log is conceptually infinite — new slots are appended as clients submit requests.

Diagram of state machine replication via logs. Three replica boxes labeled S1, S2, S3 sit inside a system box. A client C to the left has arrows to all three replicas, indicating that the client broadcasts each request to every replica. Below the system, a single conceptual log strip runs left to right with slots numbered 0 through 4 followed by an ellipsis. Slots 0, 2, and 3 contain green checkmarks indicating chosen entries; slots 1 and 4 are empty. The label "slot = index into log" appears beneath the strip.

Execution order vs. decision order

Slots can be decided out of order. Paxos running in slot 3 has no dependency on slot 2 being decided first. But execution must be in order. A server can only execute the operation in slot n once all slots 0 through n-1 are also decided. If slot 4 gets decided before slot 3, the server waits for slot 3 before executing either.

This is why the log matters: slot numbers define execution order even if consensus is reached in a different sequence.

Client workflow

Clients broadcast each request to all servers. Any server that receives a request:

Looks at its local log and picks the lowest slot that appears empty.
Runs single-decree Paxos in that slot, proposing the client's request as the value.
Waits until a prefix of the log is decided (all slots from 0 up to and including the chosen slot).
Executes those operations on its local key-value store.
Responds to the client.

If Paxos for that slot was won by a different value (because another server also tried to fill it), the server must try again in a later slot.

Differences from Single-Decree Paxos

All nodes play all roles

In single-decree Paxos, proposers, acceptors, and learners were conceptually separate. In MultiPaxos, every server plays all three roles. This makes the algorithm fully symmetric: there are no special nodes from the servers' perspective (clients are still separate).

Because every server is an acceptor, a server acting as proposer must send messages to itself as well as to the other servers. In implementation, the self-message doesn't go over the network; instead, the proposer performs the acceptor's logic locally. But conceptually, whenever a proposer sends 1a or 2a, it includes itself in the broadcast.

One practical consequence: with three servers, a proposer only needs one other server to form a majority, because it counts itself. This is different from the three-acceptor single-decree setting where you needed two out of three separate acceptors.

Ballot numbers are pairs

In single-decree Paxos, ballot numbers had to be unique across proposers. In MultiPaxos, this is enforced structurally: every ballot is a pair (seq_num, server_id), ordered lexicographically. The sequence number is compared first; server ID breaks ties.

Examples: - (1, S_2) < (4, S_1) because 1 < 4 - (1, S_2) > (1, S_1) because S_2 > S_1 (assuming S_2 is ranked higher)

Any server can pick a ballot number higher than any it has seen by incrementing its sequence number. Globally unique ballots are guaranteed without coordination.

Slot numbers are added to all messages

In unoptimized MultiPaxos, all four message types carry a slot field to identify which independent Paxos instance they belong to:

1a(slot, ballot) — prepare for a specific slot
1b(slot, ballot, summary) — promise in a specific slot, with highest prior vote
2a(slot, ballot, value) — accept request for a specific slot
2b(slot, ballot) — accept acknowledgment (value omitted since the proposer, being also a learner, already knows what it proposed)

The 2b message omits the value because every server plays the learner role and the proposer knows the value it proposed. Including the slot and ballot number is sufficient.

Unoptimized MultiPaxos

The simplest correct version of MultiPaxos just runs an independent copy of single-decree Paxos per slot:

Clients broadcast requests to all servers.
A server that receives a request finds an empty slot and runs two-phase Paxos in that slot.
Once the request's slot and all preceding slots are decided, the server executes the operations and responds to the client.

This works, but has two inefficiencies:

Two round trips per request. Every request requires a phase-1 exchange (1a/1b) and then a phase-2 exchange (2a/2b) before the server can respond to the client.
Contention from multiple proposers. Since every server receives the client broadcast and every server is a proposer, multiple servers will simultaneously try to fill the same slot. Only one will win (the one with the highest ballot), and the others have wasted work.

Spacetime diagram with five vertical process lines labeled C1, S1, S2, S3, C2. Time flows downward. C1 sends req1 to S1. S1 runs single-decree Paxos in slot 0: 1a(s=0, r=(1, S1)) is broadcast (with a self-loop on S1) to S2 and S3; S2 replies 1b(s=0, r=(1, S1), null) back to S1; S1 broadcasts 2a(s=0, v=(1, S1), req1) with a self-loop and to S2 and S3; S2 replies 2b(s=0, r=(1, S1)). S1 sends resp1 to C1. Then C2 sends req2 to S3. A similar two-phase exchange on slot 1 (drawn schematically) occurs among the servers, after which S3 sends resp2 to C2.

Distinguished Proposer Optimization (Leader Election)

The key optimization is to designate a single leader who does all the proposing, and to run phase 1 once across all slots rather than once per slot.

Phase 1 runs once, across all slots

In unoptimized MultiPaxos, each slot has its own independent phase 1. The optimization merges them: the 1a message drops the slot field entirely. A single 1a(ballot) now represents, conceptually, a phase-1 prepare sent simultaneously in every slot.

When an acceptor receives this slot-free 1a, it responds with a 1b that summarizes its voting history across all slots, as a list mapping each slot to the highest (ballot, value) it has voted for there. In the common case of a fresh cluster, this list is empty.

Once a server collects a majority of 1b responses to this combined 1a, it has completed phase 1 for all slots at once. It is now the leader for this ballot.

Because the 1a covers all slots, the ballot number is now unified across all slots. In the unoptimized protocol, slots were independent and each could start fresh with ballot 1. In the optimized protocol, round numbers are shared: electing a new leader requires a strictly higher ballot than any previously used.

Phase 2 is unchanged

Phase 2 (2a/2b) still operates per-slot. When a client request arrives, the leader just runs phase 2 in the appropriate slot, without any phase-1 overhead. This reduces the common-case latency from two round trips to one.

The leader handles client requests until it fails or the network partitions. Followers (non-leader servers) receive client broadcasts and can safely ignore them while they believe the leader is alive.

Spacetime diagram with five vertical process lines C1, S1, S2, S3, C2. Time flows downward. At the top, S1 broadcasts 1a(r = (1, S1)) (no slot number) to itself, S2, and S3. S2 replies 1b(r = (1, S1), summ = []) back to S1 — empty list of prior votes across all slots. After leader election, C1 sends req1 to S1 (the leader). S1 broadcasts 2a(s=0, v=(1, S1), req1) to itself, S2, and S3. S2 replies 2b(s=0, r=(1, S1)) back to S1. S1 sends resp1 to C1. Later, C2 sends req2 to S1, and a single phase-2 exchange (2a/2b for slot 1) completes before S1 sends resp2 back to C2.

Leader change with a higher ballot

When a follower suspects the leader has crashed, it picks a ballot number higher than any it has seen, broadcasts 1a(new_ballot) to all servers, and attempts to become the new leader. If it collects a majority of 1b responses, it succeeds and the old leader — if still running — will find itself unable to get votes for its lower ballot and must step down.

A server that detects a higher ballot in any incoming message knows its own leadership is over. It can attempt to reclaim leadership by choosing an even higher ballot and re-running phase 1.

Heartbeats

Leaders send periodic heartbeat messages to followers to signal they are still alive. When the system is busy, normal 2a messages serve as implicit heartbeats. When the system is idle, the leader sends explicit heartbeats.

Each follower runs a timer. If the timer expires without hearing from the leader, the follower assumes the leader has crashed and initiates a leader election by sending a new 1a with a higher ballot. The timer interval is a tunable parameter that trades off detection latency against spurious leader changes.

What's Next

Monday's lecture will cover the remaining details needed for Lab 3: catching up lagging servers, log compaction, and other practical considerations.