CSE550 Fall 2011 -- Problem Set #3

CSE550 -- Problem Set #3

Implementing parts of Paxos, and answering a couple of questions

Out: Monday November 7th, 2011
Due: Monday November 28th, 2011, before 5pm (PST)

[ overview | part 1: paxos synod (required) | part 2: paxos state machine (optional, bonus) | part 3: questions (required) | what to turn in ]

Overview

Parts 1 and 2: In this assignment, you will implement a simple version of the Paxos algorithm. Paxos allows you to decide upon the value of named variables in spite of fail-stop failures of participants.

At a high-level, you should implement this assignment in two steps; only the first step is required, and the second is an optional, bonus part for extra credit if you choose:

(required) implement the paxos synod algorithm (i.e., consensus over a single variable).
(optional, bonus) implement the paxos state machine algorithm (i.e., consensus over an ordered series of variables) using the paxos synod algorithm as a building block.

You should feel free to implement this assignment in any language you like, on any operating system you like. Feel free to take advantage of libraries that help you read and write to stable storage, or communicate over the network, but you cannot use an existing Paxos implementation: you must build your own.

Part 3: As well, you will answer two short questions; no more than one or two paragraphs is required for each.

Part 1 (required).

Implement the Paxos synod algorithm (consensus over a single variable).

At its heart, the Paxos synod algorithm is extremely simple; its goal is to come to consensus on the value of a single variable. The algorithm is succinctly described at the bottom of page 5 and top of page 6 in "Paxos Made Simple," which we read in class. For this assignment, each server replica will contain each of the three agents described in the paper (proposer, accepter, and learner). We will not bother with distinguished learners or proposers, and we'll leave it up to clients to decide which replica to connect to.

The acceptor agent in each replica must durably store and atomically update two pieces of state:

the highest numbered proposal that it has ever accepted (where a proposal contains both a proposal number n and a value v), and
the number of the highest-numbered prepare request to which it has responded.

A proposer agent on a replica interacts with the set of acceptor agents on all replicas using the two-phase algorithm described in the paper.

A learner agent on a replica must scavenge the state of acceptor agents on all replicas in order to decide whether a value has been chosen. A simple strategy we suggest you follow is to have acceptor agents send a message to all learner agents whenever the acceptor accepts a proposal; this way, all learners will learn about chosen values soon after they happen. However, since replicas might crash, you also need to implement functionality in which a learner queries acceptors on-demand to discover whether or not a variable has been chosen.

Each replica must export an RPC interface to the external world containing two methods:

propose(value);

if a replica receives this RPC message, then its proposer agent initiates a round of the synod algorithm to propose the value specified as an argument. This round might complete successfully and the value is chosen, or it might fail, in which case no value is chosen or some other value is chosen. The caller doesn't learn the outcome from this call.
outcome = learn();

if a replica receives this RPC message, it attempts to learn whether a value has been chosen yet, and it so, what that value is. The replica will return one of two responses to the client: (a) DONT_KNOW if the learner cannot determine that a value has been chosen, either because of failures or because no value has been chosen yet, or (b) the value that has been chosen.

Any client can connect to any replica to issue either of these methods. It's up to clients to decide which replica to connect to, but you should build your implementation of the synod algorithm to handle these method calls correctly on any replica.

To implement these methods, you will need to build some internal RPC messages between the replicas to implement the synod algorithm itself. Feel free to use an existing RPC library for your language, rather than implementing one of your own.

As well, you will need to implement some form of durable storage in the acceptor agent to store the two pieces of state mentioned above. (Strictly speaking, you need to update these two pieces of state atomically, requiring something like write-ahead logging. However, for the purpose of this assignment, if you so choose, you can update the state non-atomically using file system updates, and ignore the fact that a crash at the wrong time can break the system. Or, you can use your WAL implementation from assignment 2!) Finally, after a crash, while restarting a replica you'll need to read from this state in order to recover the in-memory state of your acceptor.

You should implement the system assuming three replicas; the set of replicas is decided upon at the beginning of time, and is known to all replicas in the system. We won't attempt to implement any form of group membership change -- i.e., the set of replicas is static. In practice, we recommend that you pass the list of replicas as a command-line argument when starting up a replica.

Feel free to make your replica implementation effectively single threaded. In other words, even though many clients might connect to a replica to issue external RPCs concurrently, you can use a single lock to only allow a single externally exposed method to be processed at a time. This will simplify your logic, but obviously will slow down the implementation. Note, of course, that different clients can connect to different replicas.

Deliverables for part 1:

the source code to your implementation
a (short) design document describing your design decisions and challenging/interesting aspects of your implementation, including the internal RPC interface that you decided on between your replicas, and any additional information you think we need to know while grading
a client program that you can use to connect to any replica and issue either a propose() or learn() RPC
a test script (called part1_test; add whatever suffix you need for your test script's language of implementation) that uses the client program to drive your implementation, and a terminal log or printout showing you running the script to exercise your implementation.

Part 2 (bonus, optional).

Implement that paxos state machine algorithm (consensus over an ordered set of variables)

Now that you have the synod algorithm working, you can use it as a building block to implement the paxos state machine algorithm. The paxos state machine algorithm uses the synod algorithm to build the abstraction of an ordered set of commands chosen by each replica; a command is just a value in the synod sense, and each command (command number 0, command number 1, command number 2, ...) is chosen by executing an instance of the synod protocol for that command number. The abstraction exposed by the state machine algorithm to clients is a method that allows a client to add a command to the sequence of chosen commands, and to receive back the sequence number of the command once it has been chosen.

To do this, you need to modify your replica so that it can run many concurrent instances of the synod algorithm, with each instance named by its instance number (number 0, number 1, number 2, ...). This, in turn, implies you'll need to modify your RPCs and the way you store state so that the instance number is passed along with the RPCs and stored along with the state.

You should add two more external RPCs exposed by replicas to clients:

command_num = add_command(value);

the goal of this method is to add a new command with value "value" to the sequence of commands that the replicas have chosen. Once the command is successfully added, the command number is returned to the client. Note that this RPC will block until the command is chosen; if the client, or the replica that the client is talking to, fails, then the command might have been chosen or it might not have been chosen. You need to implement some form of retry if the client stays up and the replica it is talking to fails, but for the purposes of this assignment, don't worry about the corner case of the command being added more than once in certain failure/retry sequences.
{ (num, value) } = list_commands();

this method returns a list of commands that have been chosen by the replica set, along with each commands sequence number.

Section 3 of paxos made simple describes how to implement these methods. In a nutshell, when a replica receives an add_command() RPC from a client, it must (a) learn what commands have already been chosen, and close any gaps in the sequence of commands chosen so far, and (b) propose and get accepted a new command with higher sequence number than all commands that have already been chosen. The only tricky part of this algorithm is figuring out how to implement the "execute phase 1 for infinitely many instances of the consensus algorithm" logic efficiently. It's not hard, but it will require you to return a set of responses from an acceptor in some cases, instead of a single response.

Deliverables for part 2:

your source code
a short description of how you implemented add_command, including modifications you had to make to part 1, and how you handled the "execute phase 1 for infinitely many instances..." logic.
a client program that you can use to connect to any replica to issue an add_command() RPC or a list_commands() RPC.
a script that uses the client program to drive your implementation, and a log or printout showing you using the script.

Part 3: Questions (required)

For part 3, you should answer the following questions, essay-style. Limit your answers to just two or three paragraphs; no more is required.

A virtual machine monitor virtualizes the hardware/software interface of a computer. It is possible to provide similar benefits as a VMM, but at different layers of abstraction in a computer. For example:
- transmeta's "code-morphing software" would translate, on-the-fly, x86 instructions into a lower-level, VLIW format. In principle, code-morphing could also virtualize;
- microkernels provide a hardware abstraction layer (HAL) that, to some degree, isolates higher-level parts of an OS from specific architectural details, enhancing portability;
- system call emulation layers or compatibility libraries, such as WINE for windows provide ABI compatibility and the ability to run apps from one platform on another;
- systems like Java, or .NET, that use bytecode provide an abstract virtual machine on which bytecode runs, and use JIT to compile on-the-fly, permitting multiple JVMs to run concurrently and the promise of portability across physical machines.
Pick one of these (or a different one of your choice, but that you think belongs in the set of systems/mechanisms discussed above) and compare and contrast it with VMMs like VMware and Xen. Your comparison should identify several basis on which it is important to focus (e.g., performance), and then make a very concise comparison for each basis.
Imagine you work for a company like Google, Facebook, or Microsoft, and you are tasked with designing a storage system that spans multiple data centers, provides strong data consistency and availability even in the face of machine, rack, or data center failure, and provides a nice API to programmers. Sketch out what you would propose, and for each of the general requirements listed in the previous sentence, (a) why it satisfies it or (b) why you think attempting to satisfy it is a design mistake. Since you're limited to just a few paragraphs, your sketch will be very high-level and rough, of course -- focus only on the most crucial issues.

Deliverables for part 3:

a file called "part3.txt" that contains your name, your answers to the two questions, and a list of references of material or people you relied on while coming up with your answers. Remember, target no more than two or three paragraphs per question.

What to turn in

When you're ready to turn in your assignment, do the following:

Create a directory called "problemset3/". In it, you should have three things: a subdirectory called "part1/", an optional subdirectory called "part/2", a file called "part3.txt", and a README.TXT file. The subdirectories should, of course, contain the deliverables for part1, and if you did it the optional part2; the file (part3.txt) should contain your answers to the two questions from part 3.
The README.TXT file should contain your name, student number, and UW email address, as well as instructions on how to launch your server.
Create a submission tarball by running the following command, but replacing "UWEMAIL" with your email account name:
tar -cvzf problemset3_submission_UWEMAIL.tar.gz problemset3
For example, since my email account is "gribble", I would run the command:
tar -cvzf problemset3_submission_gribble.tar.gz problemset3
Use the course dropbox (there is a link on the course homepage) to submit that tarball.

As usual, we will be basing your grade on several elements:

Whether your code works!

How well structured your code is: you should have clean module interfaces, a nice decomposition, good comments, and so on.

The quality of your answers to part 3.