CSEP551 Winter 2008 -- Programming Assignment #2

CSEP551 -- Programming Assignment #2

Out: Thursday February 21st, 2008
Due: Thursday March 6th, 2008, before class

In this assignment, you will build a simple Web service that provides a RESTful API to storing and retrieving blocks of data. Your goal is to learn about some modern technological building blocks, including:

building simple HTTP services
XML parsers, and the use of XML over HTTP as an RPC equivalent

Overview

In the language of your choice, you should implement a Web service that listens on a port of your choosing, and exports a REST-like API that lets programs (or people) use HTTP GET, POST, and DELETE operations to read, write, and delete blocks of data stored by your service.

First, some terminology:

we define a "blockdata" object to be 4096 bytes of binary data. No other block size (smaller or larger) is permissible.
we define a "blockid" object to be the SHA1 hash of the block data. A SHA1 hash, and therefore a blockid object, contains 20 bytes of binary data. Most languages contain support libraries for doing SHA1 hashes. For example, here is a page describing Python's support for SHA1 hashes (and other hash routines).
we define a "block" object to be the union of a blockid and a blockdata object. The blockid object within a block must have a value which is the SHA1 hash of the blockdata object within the block.
we will use "url safe base64 encoding" to represent binary data as safe, ASCII strings. For example, here is a url-safe, base64 encoded 20-byte block ID:
```
  fx8mgRaLcV8U9vxUuAibm9QOxg0=
```
Most languages contain support libraries for doing URL-safe base64 encoding and decoding. For example, here is a page describing Python's support for base64 encoding.
Be careful -- "url safe base64 encoding" is subtly different than standard base64 encoding, so be sure you're using the right one. In particular, I'd like everybody to use an implementation url safe base64 encoding that makes the following substitutions, and no other:
- - instead of +
- _ instead of /
As a cautionary example, somebody pointed me towards a Java implementation that also substitutes . for = (i.e., it changes the padding character). Do not use this implementation (or, if you use it, modify it to fix it), otherwise you'll fail all of the conformance tests...
we define a blocklist to be an XML document containing a list of block objects. Informally, the schema of the XML document should be:
```
<blocklist>

  <block>
     <blockid>(some url-safe base64 data)</blockid>
     <blockdata>(some url-safe base64 data)</blockdata>
  </block>

  <block>
     <blockid>(some url-safe base64 data)</blockid>
     <blockdata>(some url-safe base64 data)</blockdata>
  <block>

  ...etc.

</blocklist>
```
A blocklist document can contain zero or more block objects.
As a concrete example, here is a valid blocklist document that contains two block objects. (Note: your web browser won't render it correctly; download the linked document to your computer and look at it with a text editor.)

In case you are having trouble validating that you're properly doing the SHA1 hashing, the url-safe base64 encoding, or the XML document generation, here is a tar.gz archive that contains some example files that should help you out. In particular, it contains the following files:

blockdata.bin: a file containing 4096 bytes of binary data
blockid.bin: a file containing the SHA1 hash of blockdata.bin; note that this SHA1 hash is a 20 byte value
blockdata.bin.urlsafe_base64: the urlsafe base64 encoding of blockdata.bin
blockid.bin.urlsafe_base64: the urlsafe base64 encoding of blockid.bin
blocklist.xml: an XML doc containing the blocklist structure associated with this block

Web server interface

Following the REST philosophy, each block of data stored within your Web server should have its own URL. This implies that your Web server must be able to handle URLs of the following form:

  /csep551/fx8mgRaLcV8U9vxUuAibm9QOxg0=

i.e., each URL starts with "/csep551/" and then is followed by the url-safe base64 encoding of a blockid.

For each of these kinds of URL, your server must handle the following HTTP methods:

GET: your server should return a blocklist document containing the block associated with the blockID requested. Your server should return status code 200 if the requested block was found, and status code 404 if the requested block was not found. The content-type of the document (i.e., the Content-type HTTP header) that you return should be text/xml.
DELETE: your server should delete the block specified in the URL. If the block was found and deleted, the server should return status code 200. If the block didn't exist, the server should return status code 404.

As well, your server should handle uploading of blocks. To upload a block, a client should use the HTTP POST method, passing in a blocklist document as body, to the following URL:

    /csep551/post

The blocklist document should be uploaded verbatim, and clients are encouraged to use a Content-type HTTP header of text/xml in their POST request.

For each block in the blocklist, the server should verify that the blockid matches the hash of the blockdata. If all hashes match, the server should store the blocks and return a status code of 200. If any hash doesn't match, the server should store no blocks and return a status code of 400. If your server doesn't have the space to store all blocks, no blocks should be stored, and the server should return a status code of 503.

Finally, when the URL "/" is requested, your server should return a simple Web page that contains only the following string:

gribble(at)cs.washington.edu - Steven Gribble

i.e., your email address with the @ symbol replaced with the string "(at)", followed by a space, followed by a dash, followed by a space, followed by your name. This single string is the full HTML document; you should not have any HTML tags (i.e., no <html>, <body>, etc.). Also, this response should come with status code 200.

Just to summarize:

The unit of upload is the blocklist. Your server should always either store all or none of the blocks within the POSTed blocklist. If it returns a 200 status code, this implies all blocks were stored.
The unit of download (GET) and deletion (DELETE) is the individual block. This means that clients should be able to download or delete an individual block, even if that block was uploaded as part of a set of blocks in a blocklist. Another way to think of this is that your server stores individual blocks, but clients are able to batch upload a set of blocks by POST'ing a blocklist containing that set.

Additional requirements

Registering your server

While your web server is running, it should periodically (every 5 minutes) contact a server that I am running at futureproof.cs.washington.edu on port 8080, and "register" itself by simply fetching the following URL:

    /register/hostname/portnum

where "hostname" is the hostname of your web server, and "portnum" is the port number that it is running at. So, for example, if you run your web server at foo.com at port 1004, your server should fetch the following URL every 5 minutes:

    http://futureproof.cs.washington.edu:8080/register/foo.com/1004

When our server sees a registration attempt, it will:

parse the URL and verify that the hostname and portnumber are well-formatted and meaningful
attempt to connect to your web server to retrieve the "/" URL, just to perform an end-to-end check
return status code 200 if everything works
return status code 404 otherwise

State kept durably

Your server must store data durably -- even if your server crashes, if it previously accepted a POST of some blocks, those blocks must be GET/DELETE-able when your server resumes from the crash.

You can limit the amount of data you choose to store to 100MB if you like; if somebody attempts to POST blocks that would cause your server to exceed that limit, you should return a 503 status code.

Dealing with errors

Your server should be robust -- you should try to handle any corner cases of errors that you can think of, including those arising from poorly behaving or buggy clients.

NATs and port forwarding

Your server must be accessible from the wide-area internet. This means that if you choose to run it at home, and you have a home router that does NAT, you'll need to do port forwarding through your NAT so that my code can talk to your server, even though my code will be running at UW.

You also have access to the attu instructional cluster, if you'd prefer to run your code there. If you choose to do this, please be nice, and don't whammy the machines.

Test code

You will want to write a small test suite to make sure that your server functions correctly. Your test suite should be designed to exercise all parts of the server behavior specified in this assignment. (Feel free to share test suite code with each other!)

Sometime (very soon) now, the instructor will have a Web server running that adheres to this specification for you to test your test suite against.

Keep your server running

You must attempt to keep your server running until the last lecture of the quarter. Given that your server registers itself, we know where (and when) your servers are running. Your instructor and TA will be accessing your server periodically to try it out, test it for comformance, and to measure its availability. :)

Using my test harness

If you have your server up and running, and want to see a test report on how conformant it is, just launch your browser and visit the following URL:

  http://futureproof.cs.washington.edu:8080/testharness/hostname/portnum

where hostname is the DNS name of your server, and portnum is the port number it is running on. If all goes well, your server will experience some mild load, and then you'll see a report displayed. Search for "--" in the report; those are tests that your server failed. Anything that has "++" is a success case. The report is a little terse and cryptic, but hopefully it's helpful.

What to turn in

Your submission should be a single .tar.gz or .zip file, containing the following elements:

a writeup that includes:
- a brief description of how you implemented your server -- what operating system, language, and development tools did you take advantage of, and why?
- a brief description of any interesting engineering problems you faced, such as dealing with errors
- measurements showing the latencies of GET and POST operations, as well as the throughput of a large number of back-to-back POST and GET operations, and a short analysis of your measurements (why do you think they are what they are?)
source code for your server, and a precise description of what libraries we would need in order to run your server, and any additional instructions needed to launch it
source code for the test suite that you wrote or appropriated. If you borrowed code from one of your classmates, be sure to credit them.