CSE 550 problem set #2

CSE550 Problem Set #2

out: Thursday January 23rd, 2014
due: Monday February 10th, 2014 by 5:00pm.

[ Summary | Paxos | Lock Service | How to submit | Grading ]

In order to build an highly available service such as Google's MegaStore/Spanner storage systems or Facebook's social networking site, one of the most important building blocks is to provide a way in which the underlying data is replicated and can be accessed even in the presence of failures. To this end, this assignment will require you to build a replicated state machine based on Paxos. The focus of this assignment will be on the backend Paxos algorithms and in designing and building a highly reliable distributed storage system. More importantly, we would like to provide you with substantial flexibility so that you can choose any language of your choice to demonstrate your work on the Paxos implementation.

Paxos Implementation

The core of the assignment is to develop a Paxos replicated state machine. The basic idea of the Paxos RSM is described in Section 3 of the "Paxos made simple" paper by Lamport. A quick summary of the implementation strategy is as follows:

Model the service that you are implementing as a deterministic state machine whose behavior or output is simply a function of its previous state and any inputs received from the user or environment in the current step.
Implement a multi-instance Paxos algorithm to come to a consensus on the sequence of input values that the service receives in a consistent and reliable way. That is, the i^th instance of Paxos determines the i^th input or command received by the service.
From a client perspective, it can send the service request to any of the nodes that comprise the Paxos group. The node that receives the request tries to pass a Paxos instance using the received value. If it fails to pass it with the current instance, it tries to pass as the consensus value for a subsequent instance.
Run a copy of the underlying state machine on each of the nodes that belongs to the Paxos group. Each state machine executes the i^th command once it is stable and after it has executed all previous commands. A command is stable if it is the chosen value for that instance of Paxos. Once the command is stable and the state machine has processed it, a response is sent back to the client.

You can organize your code in whatever mechanism that you think is appropriate. For example, you could have a single entity that serves all three roles of proposer, acceptor, and listener. The implementation of Paxos should be robust to message loss and node failures and should be able to make progress as long as a majority of nodes are online. Note that a Paxos node could be offline for a period of time, but when it comes back online, it can help constitute a majority for passing Paxos commands. (This is the traditional fail-stop-recover model that Paxos supports.) Include a brief discussion of design issues that came up in a README.TXT file, as well as a rough overview of what your code does.

Lock Service

Demonstrate the correctness of your Paxos implementation by developing a simple lock server as the replicated state machine. The lock server should have the following functionality:

It manages the lock status for a number of locks.
It supports a blocking lock(x) operation that returns when x is assigned to the requesting client. If x is currently assigned to a different client, the request is queued at the state machine. A reply is sent only when the lock becomes available and is assigned to the requesting client.
It supports a unlock(x) operation that frees up the lock.

We would like you to focus primarily on the Paxos implementation, so you can make any simplifying assumptions that you feel appropriate with respect to the actual lock service exported by the replicated RSM. For instance, you can assume the following:

the identity of the Paxos group members can be hardcoded at the client,
the client to server communications are reliable, so you do not have to worry about issues such as atmost-once or atleast-once semantics for the delivery of client requests or responses.

What to turn in

When you're ready to turn in your assignment, do the following:

Create a directory called "problemset2". In it, you should have the code for the assignment, a README.TXT file, and a writeup.
The README.TXT file should contain your name, student number, and UW email address, as well as instructions on how to launch your server.
A major component of your grade will be based on a project write-up, in which your group should explain the protocol and any relevant implementation details. You should include, at a minimum, the following:
- a detailed description of your group's implementation
- any assumptions you made
- how to use your implementation
- any outstanding issues
- anything else that you feel is important to discuss (e.g. quirks, interesting behavior, performance characteristics, etc)
Use the course dropbox (there is a link on the course homepage) to submit that tarball.

Grading

We will be basing your grade on several elements:

Whether your code works! It should be correct, compile without warnings, and not leak memory, processes, descriptors, etc.
How well structured your code is: you should have clean module interfaces, a nice decomposition, good comments, and so on.