CSE551 Winter 2012 -- Assignment #3

CSE551 Winter 2012 -- Assignment#3

Implement parts of Paxos

Out: Tuesday February 28th, 2012
Due: Friday March 16th, 2012, before 11:45pm PST

[ overview | part 1: paxos synod (required) | part 2: paxos state machine (optional, bonus) | what to turn in ]

Parts 1 and 2: In this assignment, you will implement a simple version of the Paxos algorithm. Paxos allows you to decide upon the value of named variables in spite of fail-stop failures of participants.

At a high-level, you should implement this assignment in two steps; only the first step is required, and the second is an optional, bonus part for extra credit if you choose:

(required) implement the paxos synod algorithm (i.e., consensus over a single variable).
(optional, bonus) implement the paxos state machine algorithm (i.e., consensus over an ordered series of variables) using the paxos synod algorithm as a building block.

You should feel free to implement this assignment in any language you like, as long as your assignment runs on Linux (in particular, on attu4.cs.washington.edu, as before). Feel free to take advantage of libraries that help you read and write to stable storage, or communicate over the network, but you cannot use an existing Paxos implementation: you must build your own.

Part 1 (required).

Implement the Paxos synod algorithm (consensus over a single variable).

At its heart, the Paxos synod algorithm is extremely simple; its goal is to come to consensus on the value of a single variable. The algorithm is succinctly described at the bottom of page 5 and top of page 6 in "Paxos Made Simple," which we read in class. For this assignment, each server replica will contain each of the three agents described in the paper (proposer, accepter, and learner). We will not bother with distinguished learners or proposers, and we'll leave it up to clients to decide which replica to connect to.

The acceptor agent in each replica must durably store and atomically update two pieces of state:

the highest numbered proposal that it has ever accepted (where a proposal contains both a proposal number n and a value v), and
the number of the highest-numbered prepare request to which it has responded.

A proposer agent on a replica interacts with the set of acceptor agents on all replicas using the two-phase algorithm described in the paper.

A learner agent on a replica must scavenge the state of acceptor agents on all replicas in order to decide whether a value has been chosen. A simple strategy we suggest you follow is to have acceptor agents send a message to all learner agents whenever the acceptor accepts a proposal; this way, all learners will learn about chosen values soon after they happen. However, since replicas might crash, you also need to implement functionality in which a learner queries acceptors on-demand to discover whether or not a variable has been chosen.

Each replica must export an RPC interface to the external world containing two methods:

propose(value);

if a replica receives this RPC message, then its proposer agent initiates a round of the synod algorithm to propose the value specified as an argument. This round might complete successfully and the value is chosen, or it might fail, in which case no value is chosen or some other value is chosen. The caller doesn't learn the outcome from this call. This is RPC must be non-blocking, in the sense that it should not block forever.
outcome = learn();

if a replica receives this RPC message, it attempts to learn whether a value has been chosen yet, and it so, what that value is. The replica will return one of two responses to the client: (a) DONT_KNOW if the learner cannot determine that a value has been chosen, either because of failures or because no value has been chosen yet, or (b) the value that has been chosen. This RPC is also non-blocking; it's fair for you to return DONT_KNOW after some timeout period, for example.

Any client can connect to any replica to issue either of these methods. It's up to clients to decide which replica to connect to, but you should build your implementation of the synod algorithm to handle these method calls correctly on any replica.

To implement these methods, you will need to build some internal RPC messages between the replicas to implement the synod algorithm itself. Feel free to use an existing RPC library for your language, rather than implementing one of your own.

As well, you will need to implement some form of durable storage in the acceptor agent to store the two pieces of state mentioned above. (Strictly speaking, you need to update these two pieces of state atomically, requiring something like write-ahead logging. However, for the purpose of this assignment, if you so choose, you can update the state non-atomically using file system updates, and ignore the fact that a crash at the wrong time can break the system. Finally, after a crash, while restarting a replica you'll need to read from this state in order to recover the in-memory state of your acceptor.

You should implement the system assuming three replicas; the set of replicas is decided upon at the beginning of time, and is known to all replicas in the system. We won't attempt to implement any form of group membership change -- i.e., the set of replicas is static. In practice, we recommend that you pass the list of replicas as a command-line argument when starting up a replica.

Feel free to make your replica implementation effectively single threaded. In other words, even though many clients might connect to a replica to issue external RPCs concurrently, you can use a single lock to only allow a single externally exposed method to be processed at a time. This will simplify your logic, but obviously will slow down the implementation. Note, of course, that different clients can connect to different replicas.

Deliverables for part 1:

the source code to your implementation; if you used external libraries, you must supply them to us, including simple instructions for how to compile and/or run your implementation on attu4.
a (short) design document describing your design decisions and challenging/interesting aspects of your implementation, including the internal RPC interface that you decided on between your replicas, and any additional information you think we need to know while grading
a client program that you can use to connect to any replica and issue either a propose() or learn() RPC
a test script (called part1_test; add whatever suffix you need for your test script's language of implementation) that uses the client program to drive your implementation, and a terminal log or printout showing you running the script to exercise your implementation. Since in the common case, Paxos completes in a single (uninteresting) round, you must introduce some uncommon cases into your test. The test may deterministically plan failures and interleavings of proposals or nondeterministically produce odd behavior. If you choose to do it nondeterministically, please ensure that it will produce interesting results. Keep in mind that your test will be a jumping off point for grading and the confidence it inspires will directly affect your grade!

Part 2 (bonus, optional).

Implement that paxos state machine algorithm (consensus over an ordered set of variables)

Now that you have the synod algorithm working, you can use it as a building block to implement the paxos state machine algorithm. The paxos state machine algorithm uses the synod algorithm to build the abstraction of an ordered set of commands chosen by each replica; a command is just a value in the synod sense, and each command (command number 0, command number 1, command number 2, ...) is chosen by executing an instance of the synod protocol for that command number. The abstraction exposed by the state machine algorithm to clients is a method that allows a client to add a command to the sequence of chosen commands, and to receive back the sequence number of the command once it has been chosen.

To do this, you need to modify your replica so that it can run many concurrent instances of the synod algorithm, with each instance named by its instance number (number 0, number 1, number 2, ...). This, in turn, implies you'll need to modify your RPCs and the way you store state so that the instance number is passed along with the RPCs and stored along with the state.

You should add two more external RPCs exposed by replicas to clients:

command_num = add_command(value);

the goal of this method is to add a new command with value "value" to the sequence of commands that the replicas have chosen. Once the command is successfully added, the command number is returned to the client. Note that this RPC will block until the command is chosen; if the client, or the replica that the client is talking to, fails, then the command might have been chosen or it might not have been chosen. You need to implement some form of retry if the client stays up and the replica it is talking to fails, but for the purposes of this assignment, don't worry about the corner case of the command being added more than once in certain failure/retry sequences.
{ (num, value) } = list_commands();

this method returns a list of commands that have been chosen by the replica set, along with each commands sequence number.

Section 3 of paxos made simple describes how to implement these methods. In a nutshell, when a replica receives an add_command() RPC from a client, it must (a) learn what commands have already been chosen, and close any gaps in the sequence of commands chosen so far, and (b) propose and get accepted a new command with higher sequence number than all commands that have already been chosen. The only tricky part of this algorithm is figuring out how to implement the "execute phase 1 for infinitely many instances of the consensus algorithm" logic efficiently. It's not hard, but it will require you to return a set of responses from an acceptor in some cases, instead of a single response.

Deliverables for part 2:

your source code, including libraries, as in part 1
a short description of how you implemented add_command, including modifications you had to make to part 1, and how you handled the "execute phase 1 for infinitely many instances..." logic.
a client program that you can use to connect to any replica to issue an add_command() RPC or a list_commands() RPC.
a script that uses the client program to drive your implementation, and a log or printout showing you using the script.

What to turn in

When you're ready to turn in your assignment, do the following:

Create a directory called "problemset3/". In it, you should have three things: a subdirectory called "part1/", an optional subdirectory called "part/2", and a README.TXT file. The subdirectories should, of course, contain the deliverables for part1, and if you did it the optional part2.
The README.TXT file should contain your name, student number, and UW email address, as well as instructions on how to launch your server. If you did this assignment in a team of two, submit the assignment once, but include both teammates names/etc. in the README.TXT file.
Create a submission tarball by running the following command, but replacing "UWEMAIL" with your email account name:
tar -cvzf problemset3_submission_UWEMAIL.tar.gz problemset3
For example, since my email account is "gribble", I would run the command:
tar -cvzf problemset3_submission_gribble.tar.gz problemset3
Use the course dropbox to submit that tarball.

As usual, we will be basing your grade on several elements:

Whether your code works!

How well structured your code is: you should have clean module interfaces, a nice decomposition, good comments, and so on.