CSEP551 -- Programming Assignment #2
Out: Thursday February 21st, 2008
Due: Thursday March 6th, 2008, before class
In this assignment, you will build a simple Web service that
provides a RESTful API to storing and retrieving blocks of data.
Your goal is to learn about some modern technological building
blocks, including:
- building simple HTTP services
- XML parsers, and the use of XML over HTTP as an RPC equivalent
Overview
In the language of your choice, you should implement a Web service
that listens on a port of your choosing, and exports a REST-like API
that lets programs (or people) use HTTP GET, POST, and DELETE
operations to read, write, and delete blocks of data stored by
your service.
First, some terminology:
- we define a "blockdata" object to be 4096 bytes of binary
data. No other block size (smaller or larger) is permissible.
- we define a "blockid" object to be the SHA1 hash of
the block data. A SHA1 hash, and therefore a blockid object,
contains 20 bytes of binary data. Most languages contain
support libraries for doing SHA1 hashes. For example,
here is a page describing Python's
support for SHA1 hashes (and other hash routines).
- we define a "block" object to be the union of a
blockid and a blockdata object. The blockid object
within a block must have a value which is the SHA1 hash
of the blockdata object within the block.
- we will use "url safe base64 encoding" to represent binary
data as safe, ASCII strings. For example, here is
a url-safe, base64 encoded 20-byte block ID:
fx8mgRaLcV8U9vxUuAibm9QOxg0=
Most languages contain support libraries for doing
URL-safe base64 encoding and decoding. For example,
here is a page describing Python's
support for base64 encoding.
Be careful -- "url safe base64 encoding" is subtly
different than standard base64 encoding, so be sure you're using the
right one. In particular, I'd like everybody to use an implementation
url safe base64 encoding that makes the following substitutions, and
no other:
-
-
instead of +
-
_
instead of /
As a cautionary example, somebody pointed me towards a Java
implementation that also substitutes .
for
=
(i.e., it changes the padding character). Do
not use this implementation (or, if you use it, modify it
to fix it), otherwise you'll fail all of the conformance
tests...
- we define a blocklist to be an XML document containing
a list of block objects. Informally, the schema of the
XML document should be:
<blocklist>
<block>
<blockid>(some url-safe base64 data)</blockid>
<blockdata>(some url-safe base64 data)</blockdata>
</block>
<block>
<blockid>(some url-safe base64 data)</blockid>
<blockdata>(some url-safe base64 data)</blockdata>
<block>
...etc.
</blocklist>
A blocklist document can contain zero or more block
objects.
As a concrete example, here is
a valid blocklist document that contains two block
objects. (Note: your web browser won't render it correctly;
download the linked document to your computer and look
at it with a text editor.)
In case you are having trouble validating that you're properly
doing the SHA1 hashing, the url-safe base64 encoding, or the XML
document generation, here is a tar.gz
archive that contains some example files that should help you
out. In particular, it contains the following files:
- blockdata.bin: a file containing 4096 bytes of binary data
- blockid.bin: a file containing the SHA1 hash of blockdata.bin; note that this SHA1 hash is a 20 byte value
- blockdata.bin.urlsafe_base64: the urlsafe base64 encoding of blockdata.bin
- blockid.bin.urlsafe_base64: the urlsafe base64 encoding of blockid.bin
- blocklist.xml: an XML doc containing the blocklist structure associated with this block
Web server interface
Following the REST philosophy, each block of data stored within your
Web server should have its own URL. This implies that your Web
server must be able to handle URLs of the following form:
/csep551/fx8mgRaLcV8U9vxUuAibm9QOxg0=
i.e., each URL starts with "/csep551/" and then is followed
by the url-safe base64 encoding of a blockid.
For each of these kinds of URL, your server must handle the following HTTP
methods:
- GET: your server should return a blocklist document containing
the block associated with the blockID requested. Your server should
return status code 200 if the requested block was found, and status
code 404 if the requested block was not found. The content-type of
the document (i.e., the Content-type HTTP header) that you return
should be text/xml.
- DELETE: your server should delete the block
specified in the URL. If the block was found and deleted,
the server should return status code 200. If the block
didn't exist, the server should return status code 404.
As well, your server should handle uploading of blocks.
To upload a block, a client should use the HTTP POST method,
passing in a blocklist document as body, to the following URL:
/csep551/post
The blocklist document should be uploaded verbatim, and clients are
encouraged to use a Content-type HTTP header of text/xml
in their POST request.
For each block in the blocklist, the server should verify that
the blockid matches the hash of the blockdata. If all hashes
match, the server should store the blocks and return a status
code of 200. If any hash doesn't match, the server should
store no blocks and return a status code of 400.
If your server doesn't have the space to store all blocks,
no blocks should be stored, and the server should return
a status code of 503.
Finally, when the URL "/" is requested, your server should return a
simple Web page that contains only the following string:
gribble(at)cs.washington.edu - Steven Gribble
i.e., your email address with the @ symbol replaced with the
string "(at)", followed by a space, followed by a dash, followed
by a space, followed by your name. This single string
is the full HTML document; you should not have any HTML
tags (i.e., no <html>, <body>, etc.). Also, this
response should come with status code 200.
Just to summarize:
- The unit of upload is the blocklist. Your server should always
either store all or none of the blocks within the POSTed
blocklist. If it returns a 200 status code, this implies all blocks
were stored.
- The unit of download (GET) and deletion (DELETE) is the individual
block. This means that clients should be able to download or delete
an individual block, even if that block was uploaded as part of
a set of blocks in a blocklist. Another way to think of this is
that your server stores individual blocks, but clients are able
to batch upload a set of blocks by POST'ing a blocklist containing
that set.
Additional requirements
Registering your server
While your web server is running, it should periodically (every 5
minutes) contact a server that I am running at
futureproof.cs.washington.edu
on port 8080
,
and "register" itself by simply fetching the following URL:
/register/hostname/portnum
where "hostname" is the hostname of your web server, and
"portnum" is the port number that it is running at. So, for
example, if you run your web server at foo.com at port
1004, your server should fetch the following URL every 5 minutes:
http://futureproof.cs.washington.edu:8080/register/foo.com/1004
When our server sees a registration attempt, it will:
- parse the URL and verify that the hostname and portnumber are well-formatted and meaningful
- attempt to connect to your web server to retrieve the "/" URL, just to perform an end-to-end check
- return status code 200 if everything works
- return status code 404 otherwise
State kept durably
Your server must store data durably -- even if your server
crashes, if it previously accepted a POST of some blocks,
those blocks must be GET/DELETE-able when your server
resumes from the crash.
You can limit the amount of data you choose to store to
100MB if you like; if somebody attempts to POST blocks that
would cause your server to exceed that limit, you should
return a 503 status code.
Dealing with errors
Your server should be robust -- you should try to handle
any corner cases of errors that you can think of, including
those arising from poorly behaving or buggy clients.
NATs and port forwarding
Your server must be accessible from the wide-area internet.
This means that if you choose to run it at home, and you
have a home router that does NAT, you'll need to do port
forwarding through your NAT so that my code can talk to
your server, even though my code will be running at UW.
You also have access to the attu
instructional cluster, if you'd prefer to run your code there. If you
choose to do this, please be nice, and don't whammy the machines.
Test code
You will want to write a small test suite to make sure that
your server functions correctly. Your test suite should be
designed to exercise all parts of the server behavior
specified in this assignment. (Feel free to share
test suite code with each other!)
Sometime (very soon) now, the instructor will have a Web
server running that adheres to this specification for you
to test your test suite against.
Keep your server running
You must attempt to keep your server running until the last lecture of
the quarter. Given that your server registers itself, we know where
(and when) your servers are running. Your instructor and TA will be
accessing your server periodically to try it out, test it for
comformance, and to measure its availability. :)
Using my test harness
If you have your server up and running, and want to see a test report on
how conformant it is, just launch your browser and visit the following
URL:
http://futureproof.cs.washington.edu:8080/testharness/hostname/portnum
where hostname is the DNS name of your server, and
portnum is the port number it is running on. If all goes
well, your server will experience some mild load, and then you'll see
a report displayed. Search for "--" in the report; those are tests
that your server failed. Anything that has "++" is a success case.
The report is a little terse and cryptic, but hopefully it's helpful.
What to turn in
Your submission should be a single .tar.gz or .zip file, containing the
following elements:
- a writeup that includes:
- a brief description of how you implemented your server --
what operating system, language, and development tools did you take
advantage of, and why?
- a brief description of any interesting engineering problems
you faced, such as dealing with errors
- measurements showing the latencies of GET and POST
operations, as well as the throughput of a large number
of back-to-back POST and GET operations, and a short
analysis of your measurements (why do you think they
are what they are?)
- source code for your server, and a precise description of
what libraries we would need in order to run your server,
and any additional instructions needed to launch it
- source code for the test suite that you wrote or
appropriated. If you borrowed code from one of your
classmates, be sure to credit them.