CSE 326 Data Structures Project 3

CSE 326 Data Structures
Project 3
High Performance Web Server

Due: Monday, 14 May 2001, 10pm

Objectives

Understanding hash tables and graph algorithms
Thinking about web server design

Overview

In the days of our great-grandparents, cars were bigger, towns were smaller, and the World-Wide-Web was much simpler. It used to be that all the data displayed on a particular web page would be stored on a single server, and hence fetching a page would be a relatively simple operation. Nowadays things are much more complicated, because people have invented sophisticated techniques for optimizing the way web content is delivered over the internet. Now, when you fetch a web page, you may be directed to several servers, depending on your geographic location. For example, the text for the page may come from one server, while the images may come from another, and the advertisements may come from yet a third. Furthermore, each of these pieces of content may be replicated on several different servers. Given a request for a web page, you need to consider two problems: (1) Where should the different data items be stored so that they may be retrieved quickly and efficiently? and (2) How can we find the optimal route across the network for each data item? In this assignment we will be addressing both of these issues.

Representing the WWW and Web requests

For the purposes of this project, we will simulate the entire World-Wide-Web as a graph of nodes. Each node corresponds to a machine on the network, which can either supply data, request data or both. For simplicity, we will assume that the graph is undirected. The edges of the graph are annotated by two numbers: t, the time (in milliseconds) it takes to transmit a piece of data between the two nodes, and c, the cost (in micro-dollars) of transmitting a piece of data between the two nodes. Each node will keep a Dictionary of the data items it stores. We will represent this graph using an adjacency list representation (to be explained in class, or see Pg. 329 in the book). Nodes and data items (which are assumed to all be of the same size) are represented by integers.

A request includes the number from which the request originates, a list of requested data items, and a performance criterion. For example, the request "4 37 898 2 103" indicates that a user at Node 4 requests data items 37, 898, 2 and 103. The performance criterion determines whether the user is concerned about the total time it will take to service the request (but not the cost), or the total cost of servicing the request (but not the time), or both. The program will then output a sequence of numbers for each data item describing the path across the network that the data item could take to reach the requesting node. For example, the output resulting from the above request might look like:

37: 6 0 4 Time = 0.003 s, Cost = $0.000031 898: 4 Time = 0 s, Cost = $0 2: 3 1 7 6 0 4 Time = 0.012 s, Cost = $0.000056 103: 6 0 4 Time = 0.003 s, Cost = $0.000031 Total time = 0.012, Total cost = $0.000118

This indicates that data item 37 was found at Node 6, and that the optimal route to Node 4 was determined to be through Node 0. The time and cost for that route are the sums of the time and cost along the two edges connecting Node 6 to Node 0 and Node 0 to Node 4. The total time is the maximum of all the transit times, and the total cost is the sum of all the costs. Note that data item 898 was already at Node 4, so both the time and the cost to get that item are zero. This output is likely to be different depending on which of the three options (time, cost, or both) the user selects. If the user specifies to minimize cost, then the program should output the lowest cost paths for finding the requested data items. Similarly, if the user specifies to minimize time, then the program should output the lowest time paths for each data item. Later on in the assignment, we will specify what it means to minimize both cost and time.

What you must do

The assignment has several parts. In the first part, you'll create several data structures that will speed up the processing of requests. In the second part you will actually process requests. In the third part, you will process requests that minimize both time and cost.

Part 1

In this part of the assignment, you must process the input data and build data structures to facilitate the handling of requests. First, the program reads the network configuration from an input file. That is, it reads the nodes, the edges connecting them, and the costs on the edges. Then, from a second file, it reads the data contents for the nodes in the network. That is, for every node, it reads the list of data items it contains.

Once the program has read the network configuration and initialized the nodes with data, we need to calculate the optimal paths for requested data items to take, either to minimize time or to minimize cost. We would like to perform this calculation in linear time in the length of the path. In order to do this, we need to preprocess the data and create an index from which we may reconstruct the optimal paths.

We will construct two structures. The first is a pair of hash tables that specifies where to get data from upon request. The second will enable us to efficiently generate the paths along which data will be sent to the requesting nodes.

The two hash tables will store triplets of the form (n, d, m), that specifies that when node n requests item d, then it should be obtained from node m. For the first hash table m will be the node that can supply d for the lowest cost, and for the second hash table m will be the node that can supply d most quickly. The key of both hash tables will be pairs of the form (n,d). Hence, when we receive a request, we will be able to determine very efficiently where to get the data items from. More explanation on how to implement this is given in the More specifics section.

But the hash tables alone are not enough to reconstruct the entire path. Following the example given above section, we may look up (4, 37) in our index and find that we need to get item 37 from Node 6, but we still need to know that the path goes through Node 0. To help us reconstruct the intermediate Nodes, we will keep two NxN matrices, where N is the number of nodes (again, one matrix for time, the other for cost). To reconstruct the optimal path from node m to node n, we look at position [m][n] in the matrix. At this position there will be some value k, which will either be -1 or the number of some intermediate node along the path from m to n. If k = -1, then the optimal path is to go straight from m to n. Otherwise we must recursively reconstruct the paths from m to k and from k to n. More explanation on how to create these matrices is given in the More specifics section.

Part 2

In this part of the assignment, you must modify the main function to properly handle requests where the user specifies either time or cost criteria. When the user makes such a request, you must use the data structures you created in Part 1 to output the appropriate paths in linear time. Note that you must not only reconstruct the paths, but you must also output the total time and total cost of each path.

Part 3

In this part of the assignment, you read in requests that require both time and cost criteria. Specifically, you do the following. We want to service the request as quickly as possible, but also as cheaply as we can without sacrificing speed. Let's consider the example given above, where node 4 requests data items 37, 898, 2 and 103. Suppose that the example output given above was optimized for time, that is, item 37 takes 0.003s, item 898 takes 0s, item 2 takes 0.012s and item 103 takes 0.003s. We cannot complete the request until we have all items, so the total time it will take is 0.012s. We can see that item 2 is the limiting factor; all the other items have much lower retrieval times. So what if it was possible to retrieve the other items a bit more slowly but for a lesser cost? As long as we get them all in no more than 0.012s we're happy. So the problem may be stated as follows: "For any given request, calculate the minimum time, t, that it will take to service the request. Now output the paths for each item in the request such that the total cost to service the request is minimal within the constraint that no item may take longer than t to retrieve." Continuing with the example above, the output using these criteria might be:

37: 6 2 8 4 Time = 0.011 s, Cost = $0.000019 898: 4 Time = 0 s, Cost = $0 2: 3 1 7 6 0 4 Time = 0.012 s, Cost = $0.000056 103: 9 5 8 4 Time = 0.008 s, Cost = $0.000005 Total time = 0.012, Total cost = $0.000080

Item 37 is still retrieved from Node 6, but along a cheaper (and significantly longer) path. Item 103 is now retrieved from Node 9, where it was also stored. It could come from Node 6 along the same path 37 takes, but that path is much more costly.

Your job for this part is to develop an algorithm to solve this problem. You may not do any preprocessing of the network beyond what is required for Part 2 (if you are tempted to do this, see the README section). Your algorithm must find an optimal (i.e. lowest cost) solution in order to get full credit. The running time of your algorithm should be no worse than O( N ² ), where N is the number of nodes in the network.

More Specifics

The Hash Tables

It is your job to write the implementation for the HashTable class, as specified in the HashTable.h file. You may use separate chaining or any type of open addressing. As stated above, the key of the hash is a pair of integers. The first integer is a node number and will be in the range 0 to 99 (I won't test with more than 100 Nodes). The second integer represents a data item and will be in the range 0 to 999. Thus the hash tables will need to store no more than 100,000 items. It is up to you to chose an appropriate table size and hash function.

Here are a few of the methods you must implement:

void insert(int a, int b, int c): Inserts the triple (a, b, c) at position hash(a, b) in the table. Following the previous examples, we might insert(4, 37, 6).
bool remove(int a, int b, int c): Removes the given triple from the hash table. Returns false if the triple wasn't found.
bool lookup(int a, int b, int& c): Returns true if the key (a, b) is already present in the hash table. If so, sets c to the value corresponding to the pair (a, b). For example, lookup(4, 37, x) should set x = 6 and return true.
void print(): This method should print the entire contents of the table to cout. The output must be in the format: "(a,b,c) (d,e,f) …". Whitespace is unimportant, but the '(', ')', and ',' characters are important.

The Matrices

To build the NxN matrices, we will use the dynamic programming approach to finding the all-pairs shortest paths in a graph (this will be covered in class, but also see section 10.3.4 All-Pairs Shortest Path in the Weiss book). You will have to employ this algorithm twice in succession: once to calculate minimal time paths, and once to calculate minimal cost paths. You will need to make some adaptations to the algorithm as described in the book for our application. They are as follows:

The algorithm as specified in the book assumes that the graph is stored using an adjacency matrix representation. In order to use this algorithm you will have to initialize the NxN matrices from the adjacency list representation we use in this project.
The algorithm in the book uses creates two NxN matrices, one to store the distance between nodes, and one to keep track of intermediate nodes along the path. You should implement the algorithm using only one matrix. The key to this is that our graph is undirected; that is, the distance from node i to node j is the same as the distance from node j to node i.

After the matrices are built you must build the hash tables. You can use whatever algorithm you'd like, provided that the final index is correct. That is, for any triple (n, d, m) in the index, m must be the closest node to n that has item d in its database.

Teams

You may work in teams of two on this assignment. You may not work with the same partner you worked with for Project2! We require this because we believe it is important to have experience working with a variety of different people. Please refer to the Teamwork section of the Programming Assignment Guidelines. We strongly encourage you to work in teams! Individuals will be required to complete the same amount of work.

The project skeleton

As a starting point we've provided a few files:

Node.h, Node.cpp (do not modify)

The Node class. Nodes are used to represent the vertices of the graph (the graph in this case is the World-Wide-Web).

Edge.h (do not modify)

The edges of the graph.

webserver.cpp

This is the main driver code for the program. It reads in the network configuration and data files, and repeatedly prompts the user for their choice of what to do next. The choices are:

Quit this program
Issue a request
Print the index

Only the first of these options is implemented. You will need to implement the others. Please put any new functions you create to accomplish this in the utility.cpp file.

utility.h, utility.cpp

These are (mostly) empty files in which you should place new functions you write that aren't methods of a class. For instance, you might have a function printPath(HashTable&, int[][], int, int) that calculates and outputs the path for a given requested item. It's up to you how you want to structure your functions, but if you have them all in the same place it will be easier for me to grade your work.

HashTable.h

This is a partial specification for the HashTable class. You will have to modify this file in order to create a working HashTable class. The function prototypes in this file are methods that you must implement according to the specifications given. You may add whatever additional methods and data you desire (within reason, of course).

The following files you have seen before:

Vector.h, Vector.cpp (do not modify): Used to keep track of the Nodes in the network. One line in Vector.cpp has been changed since the last project to fix a minor bug. Also, a makeEmpty method has been added.
SortedVector.h, SortedVector.cpp (do not modify): Necessary for SortedVectorDictionary.
Dictionary.h (do not modify): Necessary for SortedVectorDictionary.
SortedVectorDictionary.h (do not modify): Used within the Node class to keep track of items in the database. I chose this over BSTree and AVLTree because lookups are faster. The linear-time cost of inserting is relatively insignificant compared to the disk I/O necessary to read the data file, and once the database is initialized it is never changed.
Makefile ???: You get to create your own Makefile this time!

All provided files can be found in the project directory:

/cse/courses/cse326/01sp/project3

Input files

There are two types of input files for this project. The first type is the network configuration file, and the second type is the data initialization file.

The network configuration file must begin with a positive integer representing the number of nodes in the network. The remaining lines of the file describe the edges in the graph. Edges are represented in the file as quadruples of non-negative integers. The first two integers are the numbers of the two nodes connected by the edge. The third integer is the length of time it takes to traverse the edge, and the fourth integer is the monetary cost of sending data over the edge.

The data initialization file is a list of data items present in the network. After each data item is a list of the nodes at which that item is stored. Colons (':') are used to indicate which numbers represent data items and which represent node numbers. So the format of a single line in the file is something like: ":int: int int ... int" (but without the quotation marks).

Some sample input files may be found in the inputs subdirectory of the main project directory.

Turnin

You are required to turn in your project3 directory, which must include at least the following files:

HashTable.h, HashTable.cpp
utility.h, utility.cpp
Your modified webserver.cpp
Your Makefile
At least two sets of original input files
Any other files you created for this project

Turn in your files electronically using turnin:

% cd ~/your326Directory/yourProject3Directory % make clean % cd ../ % turnin -c cse326 -p project3 yourProject3Directory

Note that you do not need to print out anything or hand in any paper copies! Only electronic submissions will be accepted.

The README

Your README file should document your submission (though it does not by any means replace comments in the code!). It should contain the following:

Your name(s) and the name of your team.
Approximately how long the project took you to complete.
A description of how you divided the work between you (omit this if you worked alone).
A brief statement of completeness: does your project meet the requirements? if not, in what ways does it fall short?
Acknowledgement of any assistance you received from anyone but the 326 staff or the Weiss book.
A list of the files used in your project with a brief (anywhere from one sentence to one paragraph) description of each file. You should list all the files, but you only need to describe the ones you created or changed.
A detailed description of your algorithm for Part 3. This should include a complete time complexity analysis (both best case and worst case).
A description of any extra credit you attempted.

Also answer the following questions (justify your answers):

Could you increase the efficiency (that is, decrease the time complexity) of your algorithm from Part 3 if you were allowed to do further preprocessing of the data and perhaps create additional data structures? Implement some further preprocessing, and create a modified version of your algorithm from Part 3 to take advantage of this. What are the time and space requirements of your additional preprocessing? What is the time complexity of your new algorithm?
In this assignment we primarily use an adjacency list representation of our graph. Would it be more efficient to use an adjacency matrix representation? Why or why not? Can you think of any other way to represent graphs?
What did you enjoy about this assignment? What did you hate?

If you find this too easy...

For a small amount of extra credit, you may attempt one or more of the following. If you choose to do one of these, it is important that you first have your program working exactly according to the specifications, so that it will work with my test scripts. If you would like to try something not on this list, please email me about it first (bone@cs).

In the Overview I promised that we'd be addressing the problem of deciding where to store data items in a network so that they may be retrieved quickly and efficiently. So far in the assignment we haven't addressed this. Suppose we know in advance what requests will be issued from which nodes in the network. Can we replicate some of the data items to increase the efficiency of handling these requests? The answer of course is yes, but if there is a limit to the number of items we may replicate, the problem becomes very difficult. For this extra credit option, you will implement an algorithm to decide how to distribute the data. The parameters to the algorithm are the network (with the nodes already populated with data), a list of requests, and an integer X representing the maximum number of additions we may make to the node databases. The output of the algorithm is a set of X pairs, (n, d), where n is a node number and d is an item we would like to put in n's database. Typically, X will be much smaller than the total number of requests made. Once you have implemented this algorithm, incorporate it into the existing code and run some tests (assume the both option for request handling). How much does your algorithm improve the efficiency of request handling? Describe the circumstances in which your algorithm works well, and also in which it works poorly. Note: nobody knows an efficient algorithm for this problem!
Could you increase the efficiency (that is, decrease the time complexity) of your algorithm from Part 3 if you were allowed to do further preprocessing of the data and perhaps create additional data structures? Implement some further preprocessing, and create a modified version of your algorithm from Part 3 to take advantage of this. What are the time and space requirements of your additional preprocessing? What is the time complexity of your new algorithm?
Extend the program to allow the user more control over how their requests are processed. Allow the user to specify either a maximum total cost, a maximum total time, or both for any given request. If the user specifies a cost (say, 20), then the program will try to minimize the time to handle the request without exceeding the specified cost. Similar if the user specifies a time. If the user specifies both a time and a cost, the program may output any set of paths that service the request within the specified bounds. If there is no such path, the program should print a message saying so.
Other ideas...? (email Nic)

Approximate grading breakdown

15% - correct implementation of HashTable class
20% - correct implementation of All-Pairs-Shortest-Path algorithm
10% - correct path reconstruction for Part 2
25% - correctness of algorithm for Part 3
25% - your README file
5% - submission of original input files

Acknowledgements

The idea behind this assignment was developed by Alon Halevy, Maya Rodrig, and Nic Bone. This document was written by Alon Halevy and Nic Bone. All supplied code was written by Nic Bone.