Project Overview.
|
||||||||||||||||||
In this project, you will write an HTTP proxy that passes requests and data between multiple web clients and web servers, concurrently. The HTTP proxy is capable of both relaying HTTP requests and HTTP CONNECT tunneling. You'll point a browser at your proxy, so that it sends all page requests to your code instead of directly to the page's origin server.
|
||||||||||||||||||
Background.
|
||||||||||||||||||
Hypertext Transfer Protocol (HTTP)
|
||||||||||||||||||
The Hypertext Transfer Protocol (HTTP) is the protocol used for communication on this web: it defines how your web browser requests resources from a web server and how the server responds. For simplicity, in this assignment, we will be dealing only with version 1.0 of the HTTP protocol, defined in detail in RFC 1945. You may refer to that RFC while completing this assignment, but our instructions should be self-contained. HTTP communications happen in the form of transactions; a transaction consists of a client sending a request to a server and then reading the response. Request and response messages share a common basic format:
The initial line and header lines are each followed by a "carriage-return line-feed" (\r\n) signifying the end-of-line. For most common HTTP transactions, the protocol boils down to a relatively simple series of steps (important sections of RFC 1945 are in parenthesis):
It's fairly easy to see this process in action without using a web browser. From a Linux prompt, type: telnet www.washington.edu 80This opens a TCP connection to the server at www.google.com listening on port 80 (the default HTTP port). You should see something like this: |
||||||||||||||||||
Trying 128.95.155.198... Connected to www.washington.edu (128.95.155.198). Escape character is '^]'. |
||||||||||||||||||
type the following: GET http://www.washington.edu/ HTTP/1.0and hit enter twice. You should see something like the following: |
||||||||||||||||||
HTTP/1.1 200 OK Date: Mon, 19 Oct 2015 22:30:09 GMT Server: Apache/2.2.24 (Unix) mod_ssl/2.2.24 OpenSSL/1.0.1e-fips PHP/5.6.11 mod_pubcookie/3.3.4a mod_uwa/3.2.1 Last-Modified: Mon, 19 Oct 2015 22:26:09 GMT ETag: "6e7597-a941-5227ca1c6c240" Accept-Ranges: bytes Content-Length: 43329 Connection: close Content-Type: text/html |
||||||||||||||||||
There may be some additional pieces of header information as well- setting cookies, instructions to the browser or proxy on caching behavior, etc. What you are seeing is exactly what your web browser sees when it goes to the washington home page: the HTTP status line, the header fields, and finally the HTTP message body- consisting of the HTML that your browser interprets to create a web page. You may notice here that the server responds with HTTP 1.1 even though you requested 1.0. Some web servers refuse to serve HTTP 1.0 content. |
||||||||||||||||||
HTTP Proxy
|
||||||||||||||||||
HTTP is the protocol used to transfer information between browsers and web servers. HTTP is transmitted using TCP as the transport protocol. An HTTP proxy is a program that can accept and reply to requests that would normally be directed to some web server. Proxies are an example of the use of "interposition" - placing something between two things that communicate using a well-defined interface - as shown in the figure below. Interposition is a generally useful technique. When possible, it allows new functionality to be injected into existing code with little or no modification to that code. For example, an HTTP proxy might be used for monitoring or debugging (by capturing a log of browser requests and server responses), to improve performance by maintaining a cache of web pages, or to enforce some policy about which sites can be accessed. |
||||||||||||||||||
Assignment Details.
|
||||||||||||||||||
Non-CONNECT HTTP requests.
|
||||||||||||||||||
The requirements for our proxy are very modest: it merely prints out (at least the initial portion of) the first line of each HTTP request it receives from the browser, then fetches the requested page from the origin web server and returns it to the browser. This means that, for the most part, you don't have to know anything about HTTP; you simply read what the browser sends, print out (only) the first line, and pass that and all subsequent lines on to the web server. On the other side, you read everything the web server sends and pass it back to the browser. While that's the basic operation, there are two details that require a bit of processing of the HTTP stream. To make what follows more concrete, here's an example of what Firefox sent when I requested the page www.my.example.page.com. (I obtained this by running nc -l 46103 to set it listening for TCP connections on port 46103, and then configuring Firefox to use a proxy located at localhost:46103.) |
||||||||||||||||||
$ nc -l 46103 GET http://www.my.example.page.com/ HTTP/1.1 Host: www.my.example.page.com User-Agent: Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:26.0) Gecko/20100101 Firefox/26.0 Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8 Accept-Language: en-US,en;q=0.5 Accept-Encoding: gzip, deflate Connection: keep-alive |
||||||||||||||||||
Determining the web server's address
When the browser sends an HTTP request to your proxy, you need to forward it on to the origin web server. You determine the web server by recognizing the Host line in the HTTP header. In the example above, the host is www.my.example.page.com. You should be insensitive to the case of the keyword Host, and you should be tolerant of white space anywhere it might plausibly appear. In general, the host name may be given as hostname:port or ip:port. If no port is specified, you should look for one in the URI given on the request line (the first line of the header). If there is no port there either, you should use 80 if the transport on the request line is either missing or is (case-insensitive) 'http://' and 443 if the transport is 'https://'. The HTTP specification says that lines of the header are terminated by CRLF: |
||||||||||||||||||
CR = <US-ASCII CR, carriage return (13)> LF = <US-ASCII LF, linefeed (10)> |
||||||||||||||||||
You should be lenient in interpreting this, though. For instance, you might see headers where the lines are terminated by a single LF. HTTP does not require any particular ordering for the lines of the header, except that the request line (which is always of the general form shown in the example above) must be first. The HTTP 1.1 specification requires that a Host line be provided in an HTTP request (but not in a reply). Your code does not have to work with HTTP 1.0, which doesn't require these lines. (But, I'd guess you'd have a hard time finding a browser that wanted to speak HTTP 1.0 in any case.) Turning off keep-aliveThe HTTP Connection: keep-alive line can be used to indicate that the browser (or server) wants to keep the TCP connection open even after the current HTTP request has been fully satisfied. This is a performance optimization: if the browser issues additional requests to the same server within a short time, the overhead of closing the current TCP connection and opening a new one is avoided. Supporting keep-alive greatly complicates the proxy, because it needs to do enough HTTP parsing to understand where one HTTP request ends and the subsequent one begins (and similarly for responses coming from the server). HTTP doesn't have a simple framing mechanism for marking these boundaries. To avoid that, you should filter the request and response streams, removing any Connection: keep-alive, inserting a Connection: close, and converting any Proxy-connection: keep-alive to Proxy-connection: close. That should cause the browser and web server to close the TCP connection after each request. Each HTTP request now starts with the creation of a new TCP connection and ends with TCP close, making things simpler for the proxy. The Tranformed Request HeaderThe final change we make is to lower the HTTP version of the request to HTTP 1.0. This is probably unnecessary, but the more discouragement to using persistent connections we can provide the better. With that change, the header sent on to www.my.example.page.co is this: |
||||||||||||||||||
$ nc -l 46103 GET http://www.my.example.page.com/ HTTP/1.0 Host: www.my.example.page.com User-Agent: Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:26.0) Gecko/20100101 Firefox/26.0 Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8 Accept-Language: en-US,en;q=0.5 Accept-Encoding: gzip, deflate Connection: close |
||||||||||||||||||
HTTP CONNECT Tunneling.
|
||||||||||||||||||
The HTTP request method CONNECT is used to establish a two-hop TCP connection between the client and some server. HTTP is used only to convey the CONNECT request between the client and the proxy, and to convey a success/failure response from the proxy back to the client. When the proxy receives the request, it determines the destination server (using the technique described above) and tries to open a TCP connection to it. If it succeeds, it returns an HTTP 200 OK response to the client. If the proxy fails to resolve the server address or connect to the server, it sends an HTTP 502 Bad Gateway response to the client and closes the connection. At this point, nothing has yet been sent to the server, all that's happened is that a TCP connection has been established with it (in the success case). None of the HTTP request headers are ever sent to the server. Instead, the proxy simply forwards to the server any bytes it receives after the request header on its connection with the client, and forwards to the client any bytes it receives on its connection with the server. The client may send anything at all it wants on that TCP connection - it could be HTTP messages, or it could be something else completely. HTTPS uses this technique to allow TLS to negotiate a session key between the client and the server. The proxy is simply a conduit for a binary data exchange, and the client and server exchange the same messages over the tunnel as they would over a direct TCP connection with each other. |
||||||||||||||||||
Solution Restrictions
|
||||||||||||||||||
As usual, our goal isn't to have dozens of HTTP proxy implementations,
but rather to provide you with a reasonably specific development
experience. For that reason, your implementation must conform
to these restrictions:
|
||||||||||||||||||
Testing.
|
||||||||||||||||||
We show here some sample output. The output is basically a trace of the HTTP request methods and URIs issued by the browser when fetching some page. It is very likely that two requests for the same page will result in different request streams. For one thing, the order of the requests is somewhat random. For another, the components of the page, and so the things fetched, can vary from one page fetch to another. On the other hand, some things must appear in each request trace, for instance, the request for the page itself. The result of this is that it's hard to say exactly what part of these traces your output must include. The output follows a format that your code also must follow: each HTTP request line output must be preceded by '>>> ' (and your code should print that only for such output, except in the odd case that you're printing some data and the data includes it). Note the trailing space after the '>>>' characters, before the HTTP request line starts. You must print at least the HTTP method and URI given on the request line, but you can also print the entire request line (which additionally includes the HTTP version) if that's easier. The sample output prints only the method and URI. Finally, you may print anything you want before the '>>> ' tag, and you may print any additional lines you want so long as they don't contain the '>>> ' tag. For example, you may want to print error messages, or even debugging information. If you get CONNECT instead of GET that's fine and expected.
|
||||||||||||||||||
Configuring Firefox
|
||||||||||||||||||
To use the proxy we must configure the browser to send all its requests to the proxy, instead of directly to the web servers. In Firefox you do this using Preferences, then the Advanced icon, then the Network tab, then the Settings button for "Connection." Configure the proxy manually, giving the host and port your proxy is running on. You should, eventually, allow all types of traffic to pass through your proxy (although we care only about HTTP and HTTPS (SSL)), so you should check the "Use this proxy server for all protocols" check box. It might be easier during initial test to leave it unchecked, though, in which case your proxy will see only http:// requests. | ||||||||||||||||||
Turnin.
|
||||||||||||||||||
run Script
|
||||||||||||||||||
To help us test your code, provide a run script that will build and invoke your proxy. The script should take a single argument: the port number the proxy's server socket should bind to, for example: $ ./run 1234 Execution is terminated by end-of-file on stdin or by ctrl-C. |
||||||||||||||||||
Files
|
||||||||||||||||||
When you're ready to turn in your assignment, do the following:
|