CSE461 Project 2: TCP/HTTP Proxy

Out: Friday April 18, 2014
Due: Tuesday April 29, 2014 by 11:59pm.
Teams Allowed: Yes
Teams Encouraged: Yes
Ideal Team Size: 2

Summary

There are two parts to this project, both involving TCP. Unlike UDP, TCP is a connection-based protocol. There are explicit steps to create the connection and to tear it down. The first part of this project asks you to experimentally determine how to manage TCP connections, in particular, how to detect when something "has gone wrong" with the other end. The answers can depend on the operating system you're running on, the language you're using, and even the the properties of the remote system you're talking with. Knowing that will help us write robust code.

In the second part of the project you'll write a simple proxy that will sit between a web browser and the websites it contacts. The proxy is something between a TCP proxy and an HTTP proxy -- it is more than a TCP proxy in that it understands a bit about the HTTP protocol, but is less than an HTTP proxy in the sense that it doesn't fully understand the HTTP protocol.

You'll hand in code for the HTTP proxy component and a short writeup for the TCP behavior component of this project.

Part One Background: TCP
If you do a web search for "tcp socket yourlanguage" you'll probably find dozens of pages that all contain essentially the same information. These pages are useful, and you should use them, as they give you the details for your language. In addition, Section 6.1.3 of the textbook talks about stream (TCP) sockets, and Section 6.1.4 gives example code (in C).

When you read about TCP, remember that there is a server side and a client side. The server must come up first. It creates a special server socket, binds it to an IP and port, and waits for clients to connect. Clients create sockets and connect to the server. Each time a client connects to the server's server socket, a new (regular) socket is created at the server. That socket is the server's end of the connection. This simplifies programming the server, as each client connection is represented by a socket.

Unlike the code in the text (and almost certainly the code you find by casual searches on the web), your code needs to be able to handle multiple, concurrent clients, and to deal with possible client bugs. Using threads in much the same way that you have in previous projects can address the concurrency requirement (as can non-blocking IO or the use of select). Because buggy clients may simply stop communicating, you cannot permanently block waiting for input from them. You can avoid that by using timeouts. As with UDP, setting a timeout when using TCP is an operation on the socket (although select takes a per-invocation timeout argument).

Part One Task: Experimental Evaluation of TCP Behavior
Let's imagine you have a typical application that uses TCP. You have a thread that sits in a loop trying to read from the socket. It may time out periodically, but most of the time it's blocked on the read, waiting for input to arrive. You also have another thread that performs most of the application logic. From time to time it writes to the socket, to send data to the other end.

Your goal is to determine experimentally what your code sees when the other end manipulates the TCP connection. You're interested in this because you'd like to write code that always works, and to do that you need to know what can happen when you call, say, read(). What can happen can be quite system- and language-dependent. Further, it's usually poorly documented, and what actually happens may not be quite what the specification promises. Experimentation provides the most definitive answers we can hope to have.

Let's consider what can happen to the remote end of a TCP connection. We'll start with something going seriously wrong. One possibility is that the remote process crashes abruptly, but the remote systems stays up and remains reachable over the network. On any reasonable system this should cause any TCP connections it has open to be closed, and so the behavior you see is likely to be the same as a graceful close. Alternatively, the remote system's OS or hardware may experience a sudden failure, with no possibility for the two ends of the TCP connection to communicate about it. Similarly, a network failure may disconnect the two ends, again without the chance for them to exchange any information. In these cases the remote end appears to simply stop talking, as far as TCP is concerned.

A second class of connection state change is those that happen because of an explicit request by the remote end. In general, there are two operations the other end can do to the connection: close and shutdown. Typically, a socket is closed when no further reading or writing is intended. shutdown is a bit finer grained. It takes an argument that indicates whether the connection is being shut down for reading or for writing, individually. Exactly what close and shutdown do is a mystery, and probably system/implementation dependent. See, for example, the man page for shutdown, the bulk of which is this illuminating line:

The shutdown() function shall cause all or part of a full-duplex connection on the socket associated with the file descriptor socket to be shut down.

Perform experiments that let you determine what your program sees for each operation it can be doing and each operation the other side might do. Your program can be blocked on a read when the remote operation occurs. It can also be performing a write after the remote operation occurs. That write might be the first one after the remote operation occurs, or the second, or the third. (Behaviors of those three could differ.) You might get a return code that indicates success, or one that indicates failure. You might get an exception. You might receive a signal (http://en.wikipedia.org/wiki/Unix_signal, especially SIGPIPE), which might cause your program to simply stop execution without any apparent error having taken place.

Ultimately, our goal is to be able to write robust code, which means that the code must be prepared to deal with everything that could happen to the TCP connection because of the behavior of the other end of the connection. Create a table of actions your program may take and things that could happen on the other end that indicates how your program can detect that they have taken place, so that it could deal with them without crashing or hanging.

For this part of the project, hand in a short report as file tcp.pdf, if in the preferred pdf format, or tcp.txt if in the alternative acceptable format, text.

Part Two Background: HTTP
HTTP is one of the protocols used to transfer information between browsers and web servers. HTTP is a request-response protocol: the browser makes a request to the server and the server responds. (There are no server to browser requests.) An HTTP message consists of an HTTP header followed, optionally, by message contents. The request messages sent by browsers often have an empty content section; the reply messages from the web servers often have content that is the web page.

Here is an example HTTP request, generated by asking Firefox to fetch http://www.my.example.page.com. Because I had set Firefox to proxy through localhost:46103 (proxying is discussed shortly), it directed its request there, and I captured what it sent using nc:

$ nc -l 46103
GET http://www.my.example.page.com/ HTTP/1.1
Host: www.my.example.page.com
User-Agent: Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:26.0) Gecko/20100101 Firefox/26.0
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
Accept-Language: en-US,en;q=0.5
Accept-Encoding: gzip, deflate
Connection: keep-alive

HTTP is transmitted using TCP. Firefox created a TCP connection to localhost:46103 as the first step in sending this request. The header begins with the first character sent over the connection and ends with a null line. That is, there is a line after the one starting Connection: above, but it is of length 0. The header itself is a sequence of lines of text, encoded in ASCII. According to the HTTP specification, each line should be terminated by CRLF:
       CR             = <US-ASCII CR, carriage return (13)>
       LF             = <US-ASCII LF, linefeed (10)>
Like everything about HTTP, you should be lenient in interpreting this, though. For instance, you might see headers where the lines are terminated by a single LF.

Except for the first line of the HTTP header, all header lines consist of a tag (e.g., Host), a colon (':'), and a string. The HTTP specification defines some tags and their meanings, but it is allowed (and normal) to send tags not mentioned by the specification. The only tags we will recognize are Host and Connection, as explained shortly.

Data follows the HTTP header. There is enormous flexibility in how data can be encoded, and because of that we won't attempt to understand anything about the encoding of the data, but will simply read it and pass it on. Unlike the header, the data may consist of binary data, so you have to be careful about to avoid using language/library functions that assume you're reading text, as those may modify the data in an attempt to help you.

Finally, it's important to realize that browsers and web servers both want their code to work, so it can be more important to them to deal with what the other end actually does than to be faithful to the specification.

Part Two Task: HTTP Proxy

An HTTP proxy is a program that can accept and reply to requests made using HTTP. Often these are page fetch requests that would normally be directed to some web server. In that case the proxy might query the web server itself, so that the browser's request simply passes through the proxy on the way to the server, and the response on the way back to the browser.

Proxies are an example of the use of "interposition" - placing something between two things that communicate using a well-defined interface -- as shown in the figure below. Interposition is a generally useful technique. When possible, it allows new functionality to be injected into existing code with little or no modification to that code. For example, an HTTP proxy might be used for monitoring or debugging (by capturing a log of browser requests and server responses), to improve performance by maintaining a cache of web pages, or to enforce some policy about which sites can be accessed.

The figure above shows that the proxy can be used concurrently by more than one browser, and that a single browser may concurrently fetch multiple pages. While this may appear to be complicated, if you build the proxy to allow concurrent connections (which you must), you've built all the capabilities shown. The figure also shows that the browser may use protocols other than HTTP to talk with servers. Those connections do not pass through the HTTP proxy.

The requirements for our proxy are very modest: it merely prints out the first line of each HTTP request it receives from the browser, then fetches the requested page from the sourcing web server and returns it to the browser. This means that we won't even try to fully understand HTTP. It's important to remember that if the proxy simply reads what the browser sends and immediately forwards it to the web server, that must essentially work, since that's what the browser would do if it weren't using the proxy. The same is true in the opposite direction. There are some small issues to contend with -- determining the web server's address and turning off keep-alive -- but simply passing data through is the main idea of our proxy.

In the next project we build Tor61, a system that carries messages from a client to a server in a way that makes it difficult to determine which servers a client is talking with and which clients are contacting any particular server. To generate traffic for that system we'll use browsers to fetch web pages. You'll transform the proxy code you're writing now into protocol translators - from HTTP into the Tor61 protocol, and from Tor61 back into HTTP. So, as always, good design and modular implementation will eventually be rewarded.

Determining the web server's address

When the browser sends an HTTP request to your proxy, you need to forward it on to the appropriate web server. You determine which web server by recognizing the Host line in the HTTP header. In the example above, the host is www.my.example.page.com. You should be insensitive to the case of the keyword Host, and you should be tolerant of white space anywhere it might plausibly appear. In general, the host name may be given as hostname:port or ip:port. If no port is specified, you should use 80, the default for web servers/HTTP.

HTTP does not require any particular ordering for the lines of the header, except that the request line (which is always of the general form shown in the example above) must be first. Your proxy doesn't have to handle a request if there is no Host line in the header. The HTTP 1.1 specification requires that a Host line be provided, so it's unlikely you'll see requests without them. The HTTP specification may or may not prohibit putting multiple Host lines in the header. There's no reason for our proxy not to allow that, so we do, taking the last one seen as definitive.

Turning off keep-alive

The HTTP Connection: keep-alive line can be used to indicate that the browser (or server) wants to keep the TCP connection open even after the current HTTP request has been fully satisfied. This is a performance optimization: if the browser issues additional requests to the same server within a short time, the overhead of closing the current TCP connection and opening a new one is avoided.

Supporting keep-alive greatly complicates the proxy, because it needs to do enough HTTP parsing to understand where one HTTP request ends and the subsequent one begins. HTTP doesn't have a simple framing mechanism for marking these boundaries, so trying to find them greatly increases the size of the project. Because we're unwilling to do the work required to find HTTP request boundaries, we instead prohibit the use of keep-alive.

Because we're a proxy, prohibiting keep-alive is easy: we simply filter the request and response streams, turning Connection: keep-alive into Connection: close. That causes the browser and web server to close the TCP connection after each request. Each HTTP request now starts with the creation of a new TCP connection and ends with TCP close, making things simple for the proxy.

Acting as a Filter

The figure below provides a rough suggestion for how the proxy might handle a single browser to web server connection. You don't have to build this way, but the figure might be useful in understanding what to do in any case. The upper state machine handles data flowing from the browser to the web server; the lower one handles data flowing from the web server back to the browser. Both state machines operate in two phases, one in which the HTTP header is processed (state Hdr) and one in which any HTTP message contents are processed (state Ctnt).

Labels on the transitions indicate a pattern that a line read from the browser or server matches and then an action to take on that line. The transition corresponding to the most specific match is taken. For instance, the label "Connection:... / xform & buffer" means that if a line is read that starts with Connection:, you should transform that line (by setting the tag value to close) and then add the transformed line to the buffer you use to accumulate the header. The label "* / buffer" means that if a line is read and doesn't match any other, more specific rule, to add that line to the buffer.

You'll notice that, for the most part, the proxy is in a loop reading data and then writing data. The browser-to-server direction does do some limited buffering, but the overall flavor of the proxy is that it is a filter, reading a limited window of data, possibly modifying it, and then passing it on. The filter is more efficient in both time and space than the alternative of reading all of the browser's request, say, before sending any of it on. In terms of time, the filter potentially overlaps reading and writing, leading to shorter elapsed time to send the entire request. More importantly, the filter requires an amount of memory that is independent of the total amount of data to be transferred, since the proxy keeps only a window's worth of data at a time.

The browser to server state machine must buffer (at least part of) the header because it can't create the TCP connection until it finds the Host line, which could be anywhere. It explicitly allows there to be more than one Host or Connection tag line; it's not clear whether or not HTTP allows that. The server-to-browser state machine operates similarly, except that there is no need to buffer the header because the connection on which it should be sent already exists when that state machine starts operating.

One thing you'll have to decide is how to recognize that the HTTP exchange is over, in each direction. We specify the rule "?/?" in the state machines to indicate this transition. Your work on Part One of this project should help. You should eventually close all connections involved in proxying a request, but of course you need to be careful not to close them too soon.

Additional Details

Configuring Your Browser

To use the proxy we must configure the browser to send all its requests to the proxy, instead of directly to the web servers. In Firefox you do this using Edit->Preferences or Options->Options, then the Advanced icon, then the Network tab, then the Settings button for "Connection." Select Manual Proxy Configuration and configure as an HTTP proxy. (That will allow the browser to make direct connections to web servers for protocols our proxy doesn't handle, like SSL and FTP.) You'll need to supply the host name or IP address where the proxy is running and the port it is listening on. Make sure you do not check the box that says "Use this proxy server for all protocols."

run Script

To help us test your code, provide a run script that will build and invoke your proxy. The script should take a single argument: the port number the proxy's server socket should bind to. The proxy's output should be the first line of each HTTP request it receives from the browser. Execution is terminated by Ctrl-C.

Sample Output

Most interesting web pages these days are complicated enough that the set of HTTP requests issued to fetch them will differ from fetch to fetch. We show here some sample output to give you a general sense of what your output should look like, though.

CSE Winter Quarter 2014 Time Schedule Sample Output
www.whitehouse.gov Sample Output
www.cnn.com Sample Output
(Truncated to the first approximately 40 seconds.)
Turn in
When you're ready to turn in your assignment, do the following:
  1. The files you submit should be placed in a directory named proj2. There should be no other files in that directory.
  2. Create a README.TXT file that contains the names, student numbers, and UW email addresses of the member(s) of your team.
  3. Put the README.TXT file, your HTTP proxy solution source code, your run script, and your tcp.pdf or tcp.txt report file in the proj2 directory.
  4. While in the directory that is the parent of proj2/, issue the command tar czf proj2.tar.gz proj2.
  5. Verify that the tar file contains the files you intend to submit: tar tf proj2.tar.gz.
  6. You may also want to check that your submission is packaged correctly with the project packaging check script (instructions here).
  7. Submit the proj2.tar.gz file to the course dropbox.