CSE 333 Homework 4

Out:   Friday, August 4, 2023
Due:   Wednesday, August 16, 2023 by 11:59 pm PDT

Goals

In this assignment, you will build on top of your Homework 3 implementation to complete a multithreaded web server front-end to your query processor.

  1. In Part A, you will read through some of our code to learn about the infrastructure we have built for you.
  2. In Part B, you will complete some of our classes and routines to finish the implementation of a simple web server.
  3. In Part C, you will fix some security problems in our web server.
  4. While you do these parts, you are expected to fill in a bug journal which can be found in bug_journal.md.

As before, please read through this entire document before beginning the assignment, and please start early!

Multithreaded Web Server

General Implementation Notes

  • You may not modify the Makefile distributed with the project. In particular, there are reasonable ways to do the necessary string handling without using the Boost Regex library.
  • As before, you may not modify any of the existing header files or class definitions distributed with the code. If you wish to add extra "helper" functions you can to do that by including additional static functions in the implementation (.cc) files.
  • As before, you don't need to worry about propagating errors back to callers in all situations. You will use Verify333()'s to spot errors and cause your program to crash out if they occur. However, no matter what a client does, or what input the web server reads, your web server must handle that; only internal issues (such as out of memory) should cause your web server to crash out.

Suggested Work Schedule

To help you schedule your time, here's a suggested order for the parts of this assignment. We're not going to enforce a schedule; it's up to you to manage your time.

  • Read over the project specifications and understand which code is responsible for what.

  • Finish ServerSocket.cc. Make sure to cover all functionality, not just what is in the unit tests.

  • Implement FileReader.cc, which should be very easy, and GetNextRequest() in HttpConnection.cc.

  • Complete ParseRequest() in HttpConnection.cc. This can be tricky, as it involves both Boost and string parsing.

  • Finish the code for http333d.cc. Implement HttpServer_ThrFn() in HttpServer.cc.

  • Complete ProcessFileRequest() and ProcessQueryRequest() in HttpServer.cc. At this point, you should be able to search the "333gle" site and view the webpages available under /static/, e.g. http://localhost:5555/static/bikeapalooza_2011/index.html.

  • Fix the security issues with the website, if you have any.

  • Make sure everything works as it is supposed to.

Part A: Read Through Our Code

Our web server is a fairly straightforward multithreaded application. Every time a client connects to the server, the server dispatches a thread to handle all interactions with that client. Threads do not interact with each other at all, which greatly simplifies the design of the server.

The figure to the right shows the high-level architecture of the server. There is a main class called HttpServer that uses a ServerSocket class to create a listening socket, and then sits in a loop waiting to accept new connections from clients. For each new connection that the HttpServer receives, it dispatches a thread from a ThreadPool class to handle the connection. The dispatched thread springs to life in a function called HttpServer_ThrFn() within the HttpServer.cc file.


The HttpServer_ThrFn() function handles reading requests from one client. For each request that the client sends, the HttpServer_ThrFn() invokes GetNextRequest() on the HttpConnection object to read in the next request and parse it.

To read a request, the GetNextRequest() method invokes WrappedRead() some number of times until it spots the end of the request. To parse a request, the method invokes the ParseRequest() method (also within HttpConnection). At this point, the HttpServer_ThrFun() has a fully parsed HttpRequest object (defined in HttpRequest.h).


The next job of HttpServer_ThrFn() is to process the request. To do this, it invokes the ProcessRequest() function, which looks at the request URI to determine if this is a request for a static file, or if it is a request associated with the search functionality. Depending on what it discovers, it either invokes ProcessFileRequest() or ProcessSearchRequest().

Once those functions return an HttpResponse, the HttpServer_ThrFn() invokes the WriteResponse() method on the HttpConnection object to write the response back to the client.


Our web server isn't too complicated, but there is a fair amount of plumbing to get set up. In this part of the assignment, we want you to read through a bunch of lower-level code that we've provided for you. You need to understand how this code works to finish our web server implementation, but we won't have you modify this plumbing.

Part A Instructions

  1. Change to the directory containing your CSE333 GitLab repository. Use git pull to retrieve the new hw4/ folder for this assignment. You will need the hw1/, hw2, hw3/, and projdocs/ directories in the same folder as your new hw4/ folder since hw4 links to files in those previous directories.
  2. Look around inside of hw4/ to familiarize yourself with the structure. Note that there are libhw1/, libhw2/, and libhw3/ directories that contain symlinks to your libhw1.a, libhw2.a, and libhw3.a, respectively. You can replace your libraries with ours (from the appropriate solution_binaries directories) if you prefer.
  3. Next, run make to compile the two HW4 binaries. One of which is the usual unit test binary. Run it, and you'll see the unit tests fail, crash out, and you won't yet earn the automated grading points tallied by the test suite.
  4. The second binary is the web server itself: http333d. Its usage message will reveal its command-line arguments; an example call looks like:
    bash$ ./http333d 5555 ../projdocs unit_test_indices/*
    In the meantime, start up a working web server using the provided solution binary:
    bash$ ./solution_binaries/http333d 5555 ../projdocs unit_test_indices/*
  5. Use a web browser to explore what the server should look like when it's finished:

    Enter a few search queries in the first tab and then click around the Bikeapalooza gallery in the second tab; this is what your finished web server will be capable of!

  6. When you are done with the http333d server, the most graceful way to shut it down is to open another terminal window on the same machine, run the command:
    bash$ ps -u
    to find the process id (pid) of the web server, and then run:
    bash$ kill pid
  7. Read through ThreadPool.h and ThreadPool.cc. You don't need to implement anything in either, but several pieces of the project rely on this code. The header file is well-documented, so it ought to be clear how it's used. There's also a unit test file that you can peek at.
  8. Read through HttpUtils.h and HttpUtils.cc. This class defines a number of utility functions that the rest of HW4 uses. You will have to implement some of these utilites while completing test_suite. Make sure that you understand what each of the utilities do, and why we may want them.
  9. Finally, read through HttpRequest.h and HttpResponse.h. These files define the HttpRequest and HttpResponse classes, which represent a parsed HTTP request and response, respectively.

Part B: Basic Web Server

You are now going to finish a basic implementation of the http333d web server. You will need to implement some of the event handling routines at different layers of abstraction in the web server, culiminating with generating HTTP and HTML to send to the client.

Part B Instructions

  1. Take a look at ServerSocket.h. This file contains a helpful class for creating a server-side listening socket, and accepting a new connection from a client. We've provided you with the class declaration in ServerSocket.h but no implementation in ServerSocket.cc; your next job is to build it. You'll need to make the code handle both IPv4 and IPv6 clients. Run the test suite to see if you make it past the ServerSocket unittests.
  2. Read through FileReader.h and FileReader.cc. Note that the implementation of FileReader.cc is missing; go ahead and implement it. See if you make it past the FileReader unittests.
  3. Read through HttpConnection.h and HttpConnection.cc. The two major functions in HttpConnection.cc have their implementations missing, but have generous comments for you to follow. Implement the missing functions, and see if you make it past the HttpConnection unittests.
  4. Read through HttpUtils.h and HttpUtils.cc. There are two functions in HttpUtils.cc that have their implementations missing, but have generous comments to help you figure out their implementation. Implement the missing functions, and see if you make it past the HttpUtils unittests.
  5. Read through HttpServer.cc, HttpServer.h, and http333d.cc. Note that some parts of HttpServer.cc and http333d.cc are missing. Go ahead and implement those missing functions. The only requirement here is that your web server mimics the same behavior (i.e., have a search bar, process files and queries correctly, and show their results similarly) as the solution binaries; although entirely optional, you are free to modify the look of your 333gle site:
    • If you just want to get the same "look and feel" of our server, you can use the solution binary and then view source to see the HTML to emulate.
    • In the past, some students implemented 333gle in "dark mode", had a Shrek theme, etc.
    • If you want to add more features that are more complex than altering apperance, check out the Bonus below.

    Once you have the functions implemented, test your http333d binary to see if it works by running the web server and connecting to it from a browser (as described in Part A Step 5 above), exercising both the web search and static file serving functionalities.

  6. At this point, your web server should run correctly, and everything should compile with no warnings. Try running the test_suite under valgrind to make sure there are no memory issues. Finally, launch the web server under valgrind to make sure there are no issues or leaks: after the web server has launched, exercise it by issuing a few queries, then kill the web server.

Part C: Fix Security Vulnerabilities

Now that the basic web server works, you will discover that your web server (probably) has two security vulnerabilities. We are going to point these out to you, and you will repair them. Of course, it IS possible that the way you implemented things above means you have already dealt with these flaws.

Part C Instructions

Fix the following two security flaws, if currently found in your server. As a point of reference, we've provided a version of our web server that has both of these flaws in place (solution_binaries/http333d_withflaws). Feel free to try it out, but DO NOT leave this server running, as it will potentially expose all of your files to anybody that connects to it.

  1. The first is called a "cross-site scripting" flaw. Using Firefox or Safari (Chrome prevents this attack), try typing the following query into both your web server and the solution binary web server and compare the behavior of the two:
    hello <script>alert("Boo!");</script>
    To fix this flaw, you need "escape" untrusted input from the client before you relay it to output.
  2. Use nc to connect to your web server and manually send a request for the following URI. Make sure to provide either the -C or -c arguement to support the usage of carriage returns. Check man nc to see which arguement your system needs. If you are working on attu, nc -C should work. (Note: browsers are smart enough to help defend against this attack, so you can't just type it into a browser URL bar, but nothing prevents attackers from directly connecting to your server with a program of their own!)
    /static/../hw4/http333d.cc
    This is called a directory traversal attack. Instead of trusting the file pathname provided by a client, you need to normalize the path and verify that it names a file within your document subdirectory tree (which would be ../projdocs/ if the example command shown in Part A was used to start the server). If the provided path names something outside of that subdirectory, you should return an error message instead of the file contents.

Bonus

For HW4 bonus grading, create a file readme_bonus.txt in your top-level hw4 directory for summarizing the additions. When you are done adding additional bonus parts and have committed and pushed them to your GitLab repository, tag that commit hw4-bonus. If we find a hw4-bonus tag in your repository, we'll grade the bonus parts; otherwise we'll assume that you just did the required parts.

There are two bonus tasks for this assignment.

  1. Perform a performance analysis of your web server implementation, determining:
    • What throughput your server can handle (measured both in requests per second and bytes per second),
    • What latency clients experience (measure in seconds per request), and
    • What the performance bottleneck is.

    The httperf tool for Linux can generate synthetic load. You should conduct this performance analysis for a few different usage scenarios; e.g., you could vary the size of the web page you request, and see its impact on the number of pages per second your server can deliver. If you choose to do this bonus task, please include a PDF file in your submission containing relevant performance graphs and analysis.

  2. Figure out some interesting feature to add to your web server, and implement it! Here are some example ideas:
    • Find the implementation of a chatbot, such as ELIZA, and add it to your web server.
    • Implement logging functionality; every time your server serves content, write out some record with a timestamp to a log file; make the log file available through the web server itself.
    • Change the results page to show context from matching documents, similar to how Google shows excerpts from matching pages; specifically, make it so that each result in the result list shows:
      x words + <bold>hit word</bold> + y words
      for one or more of the query words that hit.

    If you choose to do this bonus task, describe your added feature(s) and how to use them in readme_bonus.txt. This part of the assignment is deliberately open-ended, with much less structure than earlier parts. The (small) amount of extra credit granted will depend on how interesting your extension is and how well it is implemented.

Testing

As with the previous HWs, you can compile the your implementation by using the make command. This will result in several output files, including an executable called test_suite. After compiling your solution with make, you can run all of the tests for the homwork by running:

bash$ ./test_suite

You can also run only specific tests by passing command-line arguments into test_suite. This is extremely helpful for debugging specific parts of the assignment, especially since test_suite can be run with these settings through valgrind and gdb! Some examples:

  • To only run the HttpConnection tests, enter:
    bash$ ./test_suite --gtest_filter=Test_HttpConnection.*
  • To run all tests except the ServerSocket tests, enter:
    bash$ ./test_suite --gtest_filter=-Test_ServerSocket.*

You can specify which tests are run for any of the tests in the assignment — you just need to know the names of the tests! You can list them all out by running:

bash$ ./test_suite --gtest_list_tests