CSE550 -- Problem Set #3

Implementing parts of Paxos, and answering a couple of questions

Out: Monday November 7th, 2011
Due: Monday November 28th, 2011, before 5pm (PST)

[ overview | part 1: paxos synod (required) | part 2: paxos state machine (optional, bonus) | part 3: questions (required) | what to turn in ]

Overview

Parts 1 and 2: In this assignment, you will implement a simple version of the Paxos algorithm. Paxos allows you to decide upon the value of named variables in spite of fail-stop failures of participants.

At a high-level, you should implement this assignment in two steps; only the first step is required, and the second is an optional, bonus part for extra credit if you choose:

  1. (required) implement the paxos synod algorithm (i.e., consensus over a single variable).

  2. (optional, bonus) implement the paxos state machine algorithm (i.e., consensus over an ordered series of variables) using the paxos synod algorithm as a building block.

You should feel free to implement this assignment in any language you like, on any operating system you like. Feel free to take advantage of libraries that help you read and write to stable storage, or communicate over the network, but you cannot use an existing Paxos implementation: you must build your own.

Part 3: As well, you will answer two short questions; no more than one or two paragraphs is required for each.

Part 1 (required).

Implement the Paxos synod algorithm (consensus over a single variable).

At its heart, the Paxos synod algorithm is extremely simple; its goal is to come to consensus on the value of a single variable. The algorithm is succinctly described at the bottom of page 5 and top of page 6 in "Paxos Made Simple," which we read in class. For this assignment, each server replica will contain each of the three agents described in the paper (proposer, accepter, and learner). We will not bother with distinguished learners or proposers, and we'll leave it up to clients to decide which replica to connect to.

The acceptor agent in each replica must durably store and atomically update two pieces of state:

A proposer agent on a replica interacts with the set of acceptor agents on all replicas using the two-phase algorithm described in the paper.

A learner agent on a replica must scavenge the state of acceptor agents on all replicas in order to decide whether a value has been chosen. A simple strategy we suggest you follow is to have acceptor agents send a message to all learner agents whenever the acceptor accepts a proposal; this way, all learners will learn about chosen values soon after they happen. However, since replicas might crash, you also need to implement functionality in which a learner queries acceptors on-demand to discover whether or not a variable has been chosen.

Each replica must export an RPC interface to the external world containing two methods:

Any client can connect to any replica to issue either of these methods. It's up to clients to decide which replica to connect to, but you should build your implementation of the synod algorithm to handle these method calls correctly on any replica.

To implement these methods, you will need to build some internal RPC messages between the replicas to implement the synod algorithm itself. Feel free to use an existing RPC library for your language, rather than implementing one of your own.

As well, you will need to implement some form of durable storage in the acceptor agent to store the two pieces of state mentioned above. (Strictly speaking, you need to update these two pieces of state atomically, requiring something like write-ahead logging. However, for the purpose of this assignment, if you so choose, you can update the state non-atomically using file system updates, and ignore the fact that a crash at the wrong time can break the system. Or, you can use your WAL implementation from assignment 2!) Finally, after a crash, while restarting a replica you'll need to read from this state in order to recover the in-memory state of your acceptor.

You should implement the system assuming three replicas; the set of replicas is decided upon at the beginning of time, and is known to all replicas in the system. We won't attempt to implement any form of group membership change -- i.e., the set of replicas is static. In practice, we recommend that you pass the list of replicas as a command-line argument when starting up a replica.

Feel free to make your replica implementation effectively single threaded. In other words, even though many clients might connect to a replica to issue external RPCs concurrently, you can use a single lock to only allow a single externally exposed method to be processed at a time. This will simplify your logic, but obviously will slow down the implementation. Note, of course, that different clients can connect to different replicas.

Deliverables for part 1:

Part 2 (bonus, optional).

Implement that paxos state machine algorithm (consensus over an ordered set of variables)

Now that you have the synod algorithm working, you can use it as a building block to implement the paxos state machine algorithm. The paxos state machine algorithm uses the synod algorithm to build the abstraction of an ordered set of commands chosen by each replica; a command is just a value in the synod sense, and each command (command number 0, command number 1, command number 2, ...) is chosen by executing an instance of the synod protocol for that command number. The abstraction exposed by the state machine algorithm to clients is a method that allows a client to add a command to the sequence of chosen commands, and to receive back the sequence number of the command once it has been chosen.

To do this, you need to modify your replica so that it can run many concurrent instances of the synod algorithm, with each instance named by its instance number (number 0, number 1, number 2, ...). This, in turn, implies you'll need to modify your RPCs and the way you store state so that the instance number is passed along with the RPCs and stored along with the state.

You should add two more external RPCs exposed by replicas to clients:

Section 3 of paxos made simple describes how to implement these methods. In a nutshell, when a replica receives an add_command() RPC from a client, it must (a) learn what commands have already been chosen, and close any gaps in the sequence of commands chosen so far, and (b) propose and get accepted a new command with higher sequence number than all commands that have already been chosen. The only tricky part of this algorithm is figuring out how to implement the "execute phase 1 for infinitely many instances of the consensus algorithm" logic efficiently. It's not hard, but it will require you to return a set of responses from an acceptor in some cases, instead of a single response.

Deliverables for part 2:

Part 3: Questions (required)

For part 3, you should answer the following questions, essay-style. Limit your answers to just two or three paragraphs; no more is required.

  1. A virtual machine monitor virtualizes the hardware/software interface of a computer. It is possible to provide similar benefits as a VMM, but at different layers of abstraction in a computer. For example:

    • transmeta's "code-morphing software" would translate, on-the-fly, x86 instructions into a lower-level, VLIW format. In principle, code-morphing could also virtualize;

    • microkernels provide a hardware abstraction layer (HAL) that, to some degree, isolates higher-level parts of an OS from specific architectural details, enhancing portability;

    • system call emulation layers or compatibility libraries, such as WINE for windows provide ABI compatibility and the ability to run apps from one platform on another;

    • systems like Java, or .NET, that use bytecode provide an abstract virtual machine on which bytecode runs, and use JIT to compile on-the-fly, permitting multiple JVMs to run concurrently and the promise of portability across physical machines.

    Pick one of these (or a different one of your choice, but that you think belongs in the set of systems/mechanisms discussed above) and compare and contrast it with VMMs like VMware and Xen. Your comparison should identify several basis on which it is important to focus (e.g., performance), and then make a very concise comparison for each basis.

  2. Imagine you work for a company like Google, Facebook, or Microsoft, and you are tasked with designing a storage system that spans multiple data centers, provides strong data consistency and availability even in the face of machine, rack, or data center failure, and provides a nice API to programmers. Sketch out what you would propose, and for each of the general requirements listed in the previous sentence, (a) why it satisfies it or (b) why you think attempting to satisfy it is a design mistake. Since you're limited to just a few paragraphs, your sketch will be very high-level and rough, of course -- focus only on the most crucial issues.
Deliverables for part 3:

What to turn in

When you're ready to turn in your assignment, do the following:

  1. Create a directory called "problemset3/". In it, you should have three things: a subdirectory called "part1/", an optional subdirectory called "part/2", a file called "part3.txt", and a README.TXT file. The subdirectories should, of course, contain the deliverables for part1, and if you did it the optional part2; the file (part3.txt) should contain your answers to the two questions from part 3.

  2. The README.TXT file should contain your name, student number, and UW email address, as well as instructions on how to launch your server.

  3. Create a submission tarball by running the following command, but replacing "UWEMAIL" with your email account name:
    tar -cvzf problemset3_submission_UWEMAIL.tar.gz problemset3
    For example, since my email account is "gribble", I would run the command:
    tar -cvzf problemset3_submission_gribble.tar.gz problemset3

  4. Use the course dropbox (there is a link on the course homepage) to submit that tarball.

As usual, we will be basing your grade on several elements: