CSE550 Problem Set #2
out: thursday oct 19th, 2017
due: friday nov 3rd, 2017 by 5:00pm.
[
Summary |
Paxos |
Lock Service |
Submission |
In order to build an highly available service such as Google's MegaStore/Spanner
storage systems or Facebook's social networking site, one of
the most important building blocks is to provide a way in which the
underlying data is replicated and can be accessed even in the presence
of failures.
To this end, this assignment will require you to build a
replicated state machine based on Paxos.
The focus of this assignment will be on the backend Paxos algorithms
and in designing and building a highly reliable distributed storage
system.
Beyond the base requirements outlined in this document,
this assignment is open-ended and we would like to give you
substantial flexibility in design and implementation choices,
as well as the ability to explore the features of Paxos-based
systems that interest you.
Specifically, you can choose your implementation language,
as long as your code compiles and runs on the UW CSE home VM. (In other
words, C/C++/Go/Java/Python are fine, but not something like C#.)
Because of the open-ended nature of this assignment,
deliverables include (a) a working implementation,
(b) documentation of your system and design decisions in a writeup,
and (c) a presentation of your system in a live demo to a course TA.
The core of the assignment is to develop a Paxos replicated state
machine. The basic idea of the Paxos RSM is described in Section 3 of
the "Paxos made simple" paper by Lamport. A quick summary of the
implementation strategy is as follows:
- Model the service that you are implementing as a deterministic
state machine whose behavior or output is simply a function of its
previous state and any inputs received from the user or environment
in the current step.
- Implement a multi-instance Paxos algorithm to come to a consensus on
the sequence of input values that the service receives in a
consistent and reliable way. That is, the ith instance of
Paxos determines the ith input or command received by the
service.
- From a client perspective, it can send the service request
to any of the nodes that comprise the Paxos group. The node that
receives the request tries to pass a Paxos instance using the
received value. If it fails to pass it with the current instance,
it tries to pass as the consensus value for a subsequent instance.
- Run a copy of the underlying state machine on each of the nodes
that belongs to the Paxos group. Each state machine executes the
ith command once it is stable and after it has executed all
previous commands. A command is stable if it is the chosen value
for that instance of Paxos. Once the command is stable and the
state machine has processed it, a response is sent back to the
client.
You can organize your code in whatever mechanism that you think is
appropriate. For example, you could have a single entity that serves
all three roles of proposer, acceptor, and listener. The only
requirement is that you must have the ability to terminate any of your
Paxos nodes at will (e.g., if each node is a separate process, you can
kill the process to terminate the node). Include a discussion of
design issues that came up in your writeup, as well as a rough
overview of what your code does.
The
implementation of Paxos should be robust to message loss and node
failures and should be able to make progress as long as a majority of
nodes are online.
However, given the short timeline for this assignment, we don't require
the following features in your Paxos implementation:
- Node recovery. Assume that once a node fails then it will
never recover. This means that there is no need for a
persistent log to disk.
- Resending messages. Although your implementation should
still be able to make progress given a reasonable threshold of
lost messages, your implementation does not need to
resend messages that may have been lost.
Demonstrate the correctness of your Paxos implementation by
developing a simple lock server as the replicated state machine. The
lock server should have the following functionality:
- It manages the lock status for a number of locks.
- It supports a lock(x) operation that:
- If the lock is available, it obtains the lock and returns success.
- If the lock has been assigned to some other client, it simply returns failure. It is up to the client to retry the operation using some form of exponential back-off heuristic.
- It supports a unlock(x) operation that frees up the
lock.
We would like you to focus primarily on the Paxos implementation, so
you can make any simplifying assumptions that you feel appropriate
with respect to the actual lock service exported by the replicated
RSM. For instance, you can assume the following:
- the identity of the Paxos group members can be hardcoded at the
client,
- clients will be well-behaved, so that your system does not need to
handle deadlock, clients that fail after acquiring a lock, etc.
- the client to server communications are reliable, so you do not
have to worry about issues such as atmost-once or atleast-once
semantics for the delivery of client requests or responses.
Client code that you submit with the assignment should however satisfy
the following two requirements that will aid in testing:
- clients should be able to issue requests to different replicas of
the lock service. If all of the clients issue requests to the same
replica, it’s hard to know whether the underlying Paxos
implementation works. One option is to submit the client request to
a randomly chosen replica.
- clients should be able to submit multiple client requests to the
lock service concurrently.
When you're ready to turn in your assignment, do the following:
- Create a directory called "problemset2". In it, you should
have the code for the assignment, a README.txt file, and a writeup.
- The README.txt file should contain the name,
student number, and UW email address for all team members,
as well as instructions on how to launch your server.
- A major component of your grade will
be based on a project write-up, in which your group should explain the
protocol and any relevant implementation details. You should include,
at a minimum, the following:
- a detailed description of your group's implementation
- any assumptions you made
- how to use your implementation
- any outstanding issues
- anything else that you feel is important to discuss
(e.g. quirks, interesting behavior, performance characteristics,
etc)
- Use the course dropbox (there is a link on the course
homepage) to submit that tarball.
- Schedule a live demo to present your system.
Your grade will be based on all deliverables: implementation, writeup, and live demo.