CSEP551 Fall 2009 -- Programming Assignment #3

CSEP551 -- Programming Assignment #3

Implementing parts of Paxos

Out: Thursday November 24th, 2009
Due: Monday December 14th, 2009, before noon (PST)

[ overview | step 1 | step 2 (bonus) | partners | what to turn in ]

Overview

In this assignment, you will implement a simple version of the Paxos algorithm. Paxos allows you to decide upon the value of named variables in spite of fail-stop failures of participants.

At a high-level, you should implement this assignment in two steps; only the first step is required, and the second is an optional, bonus part for extra credit if you choose:

(required) implement the paxos synod algorithm (i.e., consensus over a single variable).
(optional, bonus) implement the paxos state machine algorithm (i.e., consensus over an ordered series of variables) using the paxos synod algorithm as a building block.

You should feel free to implement this assignment in any language you like, on any operating system you like. Feel free to take advantage of libraries that help you read and write to stable storage, or communicate over the network, but you cannot use an existing Paxos implementation: you must build your own.

Step 1 (required).

Implement the Paxos synod algorithm (consensus over a single variable).

At its heart, the Paxos synod algorithm is extremely simple; its goal is to come to consensus on the value of a single variable. The algorithm is succinctly described at the bottom of page 5 and top of page 6 in "Paxos Made Simple," which we read in class. For this assignment, each server replica will contain each of the three agents described in the paper (proposer, accepter, and learner). We will not bother with distinguished learners or proposers, and we'll leave it up to clients to decide which replica to connect to.

The acceptor agent in each replica must durably store and atomically update two pieces of state:

the highest numbered proposal that it has ever accepted (where a proposal contains both a proposal number n and a value v), and
the number of the highest-numbered prepare request to which it has responded.

A proposer agent on a replica interacts with the set of acceptor agents on all replicas using the two-phase algorithm described in the paper.

A learner agent on a replica must scavenge the state of acceptor agents on all replicas in order to decide whether a value has been chosen. A simple strategy we suggest you follow is to have acceptor agents send a message to all learner agents whenever the acceptor accepts a proposal; this way, all learners will learn about chosen values soon after they happen. However, since replicas might crash, you also need to implement functionality in which a learner queries acceptors on-demand to discover whether or not a variable has been chosen.

Each replica must export an RPC interface to the external world containing two methods:

propose(value);

if a replica receives this RPC message, then its proposer agent initiates a round of the synod algorithm to propose the value specified as an argument. This round might complete successfully and the value is chosen, or it might fail, in which case no value is chosen or some other value is chosen. The caller doesn't learn the outcome from this call.
outcome = learn();

if a replica receives this RPC message, it attempts to learn whether a value has been chosen yet, and it so, what that value is. The replica will return one of two responses to the client: (a) DONT_KNOW if the learner cannot determine that a value has been chosen, either because of failures or because no value has been chosen yet, or (b) the value that has been chosen.

Any client can connect to any replica to issue either of these methods. It's up to clients to decide which replica to connect to, but you should build your implementation of the synod algorithm to handle these method calls correctly on any replica.

To implement these methods, you will need to build some internal RPC messages between the replicas to implement the synod algorithm itself. As well, you will need to implement some form of durable storage in the acceptor agent to store the two pieces of state mentioned above. (Strictly speaking, you need to update these two pieces of state atomically, requiring something like write-ahead logging. However, for the purpose of this assignment, if you so choose, you can update the state non-atomically using file system updates, and ignore the fact that a crash at the wrong time can break the system.) Finally, after a crash, while restarting a replica you'll need to read from this state in order to recover the in-memory state of your acceptor.

You should implement the system assuming three replicas; the set of replicas is decided upon at the beginning of time, and is known to all replicas in the system. We won't attempt to implement any form of group membership change -- i.e., the set of replicas is static. In practice, we recommend that you pass the list of replicas as a command-line argument when starting up a replica.

Feel free to make your replica implementation effectively single threaded. In other words, even though many clients might connect to a replica to issue external RPCs concurrently, you can use a single lock to only allow a single externally exposed method to be processed at a time. This will simplify your logic, but obviously will slow down the implementation. Note, of course, that different clients can connect to different replicas.

Deliverables for part 1:

the source code to your implementation
a (short) design document describing your design decisions and challenging/interesting aspects of your implementation, including the internal RPC interface that you decided on between your replicas
a client program that you can use to connect to any replica and issue either a propose() or learn() RPC
a test script that uses the client program to drive your implementation, and a terminal log or printout showing you running the script to exercise your implementation.

Step 2 (bonus).

Implement that paxos state machine algorithm (consensus over an ordered set of variables)

Now that you have the synod algorithm working, you can use it as a building block to implement the paxos state machine algorithm. The paxos state machine algorithm uses the synod algorithm to build the abstraction of an ordered set of commands chosen by each replica; a command is just a value in the synod sense, and each command (command number 0, command number 1, command number 2, ...) is chosen by executing an instance of the synod protocol for that command number. The abstraction exposed by the state machine algorithm to clients is a method that allows a client to add a command to the sequence of chosen commands, and to receive back the sequence number of the command once it has been chosen.

To do this, you need to modify your replica so that it can run many concurrent instances of the synod algorithm, with each instance named by its instance number (number 0, number 1, number 2, ...). This, in turn, implies you'll need to modify your RPCs and the way you store state so that the instance number is passed along with the RPCs and stored along with the state.

You should add two more external RPCs exposed by replicas to clients:

command_num = add_command(value);

the goal of this method is to add a new command with value "value" to the sequence of commands that the replicas have chosen. Once the command is successfully added, the command number is returned to the client. Note that this RPC will block until the command is chosen; if the client, or the replica that the client is talking to, fails, then the command might have been chosen or it might not have been chosen. You need to implement some form of retry if the client stays up and the replica it is talking to fails, but for the purposes of this assignment, don't worry about the corner case of the command being added more than once in certain failure/retry sequences.
{ (num, value) } = list_commands();

this method returns a list of commands that have been chosen by the replica set, along with each commands sequence number.

Section 3 of paxos made simple describes how to implement these methods. In a nutshell, when a replica receives an add_command() RPC from a client, it must (a) learn what commands have already been chosen, and close any gaps in the sequence of commands chosen so far, and (b) propose and get accepted a new command with higher sequence number than all commands that have already been chosen. The only tricky part of this algorithm is figuring out how to implement the "execute phase 1 for infinitely many instances of the consensus algorithm" logic efficiently. It's not hard, but it will require you to return a set of responses from an acceptor in some cases, instead of a single response.

Deliverables for part 2:

your source code
a short description of how you implemented add_command, including modifications you had to make to part 1, and how you handled the "execute phase 1 for infinitely many instances..." logic.
a client program that you can use to connect to any replica to issue an add_command() RPC or a list_commands() RPC.
a script that uses the client program to drive your implementation, and a log or printout showing you using the script.

Partnering up

You have the choice of doing this assignment solo, or in teams of two. You're responsible for self-forming teams of two if you'd like to do this with a partner. I'd recommend re-using your partner from assignment #2.

What to turn in

You should submit your assignment using the following "dropbox" URL:

https://catalysttools.washington.edu/collectit/dropbox/gribble/7604

Your submission should be a single .tar.gz or .zip file, containing the deliverables for part 1, and if you decided to attempt it, part 2.