Lecture 12: MultiPaxos — Notes
These are notes from the lecture on April 24, 2026. See also the whiteboard descriptions and the whiteboard PDF.
These materials were drafted by AI based on the live whiteboard PDF and audio transcript from the corresponding lecture and then reviewed and edited by course staff. They may contain errors. Please let us know if you spot any.
From Single-Decree Paxos to a Replicated Log
Single-decree Paxos chooses one value. To build a fault-tolerant key-value store, we need to agree on a whole sequence of operations. MultiPaxos does this by running Paxos once per slot, where a slot is an index into a replicated log. All servers keep a local copy of the log, and all servers maintain a copy of the key-value store that they update by executing the operations in order.
A slot is chosen once Paxos reaches consensus on what operation goes there. Visually, a green checkmark in a slot means consensus has been reached; an empty slot means it has not. The log is conceptually infinite — new slots are appended as clients submit requests.
Execution order vs. decision order
Slots can be decided out of order. Paxos running in slot 3 has no dependency on slot 2 being
decided first. But execution must be in order. A server can only execute the operation in
slot n once all slots 0 through n-1 are also decided. If slot 4 gets decided before
slot 3, the server waits for slot 3 before executing either.
This is why the log matters: slot numbers define execution order even if consensus is reached in a different sequence.
Client workflow
Clients broadcast each request to all servers. Any server that receives a request:
- Looks at its local log and picks the lowest slot that appears empty.
- Runs single-decree Paxos in that slot, proposing the client's request as the value.
- Waits until a prefix of the log is decided (all slots from 0 up to and including the chosen slot).
- Executes those operations on its local key-value store.
- Responds to the client.
If Paxos for that slot was won by a different value (because another server also tried to fill it), the server must try again in a later slot.
Differences from Single-Decree Paxos
All nodes play all roles
In single-decree Paxos, proposers, acceptors, and learners were conceptually separate. In MultiPaxos, every server plays all three roles. This makes the algorithm fully symmetric: there are no special nodes from the servers' perspective (clients are still separate).
Because every server is an acceptor, a server acting as proposer must send messages to
itself as well as to the other servers. In implementation, the self-message doesn't go
over the network; instead, the proposer performs the acceptor's logic locally. But
conceptually, whenever a proposer sends 1a or 2a, it includes itself in the broadcast.
One practical consequence: with three servers, a proposer only needs one other server to form a majority, because it counts itself. This is different from the three-acceptor single-decree setting where you needed two out of three separate acceptors.
Ballot numbers are pairs
In single-decree Paxos, ballot numbers had to be unique across proposers. In MultiPaxos,
this is enforced structurally: every ballot is a pair (seq_num, server_id), ordered
lexicographically. The sequence number is compared first; server ID breaks ties.
Examples:
- (1, S_2) < (4, S_1) because 1 < 4
- (1, S_2) > (1, S_1) because S_2 > S_1 (assuming S_2 is ranked higher)
Any server can pick a ballot number higher than any it has seen by incrementing its sequence number. Globally unique ballots are guaranteed without coordination.
Slot numbers are added to all messages
In unoptimized MultiPaxos, all four message types carry a slot field to identify which independent Paxos instance they belong to:
1a(slot, ballot)— prepare for a specific slot1b(slot, ballot, summary)— promise in a specific slot, with highest prior vote2a(slot, ballot, value)— accept request for a specific slot2b(slot, ballot)— accept acknowledgment (value omitted since the proposer, being also a learner, already knows what it proposed)
The 2b message omits the value because every server plays the learner role and the
proposer knows the value it proposed. Including the slot and ballot number is sufficient.
Unoptimized MultiPaxos
The simplest correct version of MultiPaxos just runs an independent copy of single-decree Paxos per slot:
- Clients broadcast requests to all servers.
- A server that receives a request finds an empty slot and runs two-phase Paxos in that slot.
- Once the request's slot and all preceding slots are decided, the server executes the operations and responds to the client.
This works, but has two inefficiencies:
- Two round trips per request. Every request requires a phase-1 exchange (
1a/1b) and then a phase-2 exchange (2a/2b) before the server can respond to the client. - Contention from multiple proposers. Since every server receives the client broadcast and every server is a proposer, multiple servers will simultaneously try to fill the same slot. Only one will win (the one with the highest ballot), and the others have wasted work.
Distinguished Proposer Optimization (Leader Election)
The key optimization is to designate a single leader who does all the proposing, and to run phase 1 once across all slots rather than once per slot.
Phase 1 runs once, across all slots
In unoptimized MultiPaxos, each slot has its own independent phase 1. The optimization
merges them: the 1a message drops the slot field entirely. A single 1a(ballot) now
represents, conceptually, a phase-1 prepare sent simultaneously in every slot.
When an acceptor receives this slot-free 1a, it responds with a 1b that summarizes its
voting history across all slots, as a list mapping each slot to the highest (ballot, value)
it has voted for there. In the common case of a fresh cluster, this list is empty.
Once a server collects a majority of 1b responses to this combined 1a, it has completed
phase 1 for all slots at once. It is now the leader for this ballot.
Because the 1a covers all slots, the ballot number is now unified across all slots. In
the unoptimized protocol, slots were independent and each could start fresh with ballot 1.
In the optimized protocol, round numbers are shared: electing a new leader requires a
strictly higher ballot than any previously used.
Phase 2 is unchanged
Phase 2 (2a/2b) still operates per-slot. When a client request arrives, the leader
just runs phase 2 in the appropriate slot, without any phase-1 overhead. This reduces the
common-case latency from two round trips to one.
The leader handles client requests until it fails or the network partitions. Followers (non-leader servers) receive client broadcasts and can safely ignore them while they believe the leader is alive.
Leader change with a higher ballot
When a follower suspects the leader has crashed, it picks a ballot number higher than any it
has seen, broadcasts 1a(new_ballot) to all servers, and attempts to become the new leader.
If it collects a majority of 1b responses, it succeeds and the old leader — if still
running — will find itself unable to get votes for its lower ballot and must step down.
A server that detects a higher ballot in any incoming message knows its own leadership is over. It can attempt to reclaim leadership by choosing an even higher ballot and re-running phase 1.
Heartbeats
Leaders send periodic heartbeat messages to followers to signal they are still alive.
When the system is busy, normal 2a messages serve as implicit heartbeats. When the system
is idle, the leader sends explicit heartbeats.
Each follower runs a timer. If the timer expires without hearing from the leader, the
follower assumes the leader has crashed and initiates a leader election by sending a new
1a with a higher ballot. The timer interval is a tunable parameter that trades off
detection latency against spurious leader changes.
What's Next
Monday's lecture will cover the remaining details needed for Lab 3: catching up lagging servers, log compaction, and other practical considerations.