For homework #3, you will build on your homework #2 implementation. In Part A, you will write code that takes an in-memory inverted index produced by hw2 and writes it out to disk in an architecture-neutral format. In Part B, you will write C++ code that walks through an on-disk index to service a lookup. Finally, in Part C, you will write a query processor that serves queries from multiple on-disk indices.

As before, please read through this entire document before beginning the assignment, and please start early! There is even more code involved with HW3 than in earlier homeworks, and since this is your first serious attempt to use C++, you should expect to encounter a lot of problems along the way. Also, manipulating on-disk data is trickier than in-memory data structures, so plan for some time to get this part right.

In HW3, as with HW2, you don’t need to worry about propagating errors back to callers in all situations. You will use Verify333() to spot errors and cause your program to crash out if they occur. We will not be using C++ exceptions in HW3.

Part A: finish our memory-to-file index marshaller

Keeping a search engine index in memory is problematic, since memory is expensive and also volatile. So, you’re going to write some C++ code that takes advantage of your HW2 implementation to first build an in-memory index of a file subtree, and then it will write that index into an index file in an architecture-neutral format.

What do we mean by architecture-neutral? Every time we need to store an integer in the file’s data structure, we will store it in big endian representation. This is the representation that is conventionally used for portability, but the bad news is that this is the opposite representation than most computers you use: x86 computers are little endian. So, you will need to convert integers (whether 16-bit, 32-bit, or 64-bit) into big endian before writing them into the file. We provide you with helper functions to do this.

The good news is that we’re going to keep roughly the same data structure inside the file as you built up in memory: we’ll have chained hash tables that are arrays of buckets containing linked lists. And, our inverted index with be a hash table containing a bunch of embedded hash tables. But, we need to be very precise about the specific layout of these data structures within the file. So, let’s first walk through our specification of an index file’s format. We’ll do this first at a high level of abstraction, showing the major components within the index file. Then, we’ll zoom into these components, showing additional details about each.

At a high-level, the index file looks like the figure on the right. The index file is split into three major pieces: a header, the doctable, and the index. We’ll talk about each in turn.

An index file’s header contains metadata about the rest of the index file.

The first four bytes of the header are a magic number, or format indicator. Specifically, we use the 32-bit number 0xCAFEF00D. We will always write the magic number out as the last step in preparing an index file. This way, if your program crashes partway through writing one, the magic number will be missing, and it will be easy to tell that the index file is corrupt.

The next four bytes are a checksum of the doctable and index regions of the file. A checksum is a mathematical signature of a bunch of data, kind of like a hash value. By including a checksum of most of the index file within the header, we can tell if the index file has been corrupted, such as by a disk error. If the checksum stored in the header doesn’t match what we recalculate when opening an index file, we know the file is corrupt and we can discard it.

The next four bytes store the size of the doctable region of the file. The size is stored as a 32-bit, unsigned, big endian integer.

The final four bytes of the header store the size of the index region of the file, in exactly the same way.

Doctable

Let’s drill down into the next level of detail by examining the content of the doctable region of the file. The doctable is a hash table that stores a mapping from 64-bit document ID to an ASCII string representing the document’s filename. This is the docid_to_docname HashTable that you built up in HW2, but stored in the file rather than in memory.

The doctable consists of three regions; let’s walk through them, and then drill down into some details.

Phew! That wasn’t so bad.

Index

The index is the most complicated of the three regions within the index file. The great news is that it has pretty much the same structure as the doctable: it is just a hash table, laid out exactly the same way. The only part of the index that differs from the doctable is the structure of each element. Let’s focus on that.

An index maps from a word to an embedded docID hash table, or docID table. So, each element of the index contains enough information to store all of that. Specifically, an index table element contains:

docIDtable

Like the doctable table, each embedded docIDtable table within the index is just a hash table! A docIDtable maps from a 64-bit docID to a list of positions with that document that the word can be found in. So, each element of the docID table stores exactly that:

So, putting it all together, the entire index file contains a header, a doctable (a hash table that maps from docID to filename), and an index. The index is a hash table that maps from a word to an embedded doctable. The doctable is a hash table that maps from a document it to a list of word positions within that document.

Easy!

What to do

The bulk of the work in this homework is in this step. We’ll tackle it in parts.

Change to the directory that has your hw1 and hw2 directories in it. Click (or right-click if needed) on this hw3.tar.gz link to download the archive containing the starter code for hw3. Extract its contents (tar xzf hw3.tar.gz). You will need the hw1 and hw2 directories in the same folder as your new hw3 folder since hw3 links to files in those previous directories.

Look around inside of hw3/ to familiarize yourself with the structure. Note that there is a libhw1/ directory that contains a symlink to your libhw1.a and a libhw2/ directory that contains a symlink to your libhw2.a. You can can replace your libraries with ours (from the appropriate solution_binaries directories) if you prefer.

Next, run make to compile the three HW3 binaries. One of them is the usual unit test binary. Run it, and you’ll see the unit tests fail, crash out, and you won’t yet earn the automated grading points tallied by the test suite.

Now, take a look inside fileindexutil.h and filelayout.h. These header files contains some useful utility routines and classes you’ll take advantage of in the rest of the assignment. We’ve provided the full implementation of fileindexutil.cc. Next, look inside fileindexwriter.h; this header file declares the WriteIndex() function, which you will be implementing in this part of the assignment. Also, look inside buildfileindex.cc; this file makes use of fileindexwriter.h’s WriteIndex(), and your HW2 CrawlFileTree(), to crawl a file subtree and write the resulting index out into an index file. Try running the solution_binaries/buildfileindex program to build one or two index files for one or for a directory subtree, and then run the solution_binaries/filesearchshell program to input.

Finally, it’s time to get to work! Open up fileindexwriter.cc and take a look around inside. It looks complex, but all of the helper routines and major functions correspond pretty directly to our walkthrough of the data structures above. Start by reading through WriteIndex(); we’ve given you part of its implementation. Then, start recursively descending through all the functions it calls, and implement the missing pieces. (Look for MISSING: in the text to find what you need to implement.)

Once you think you have the writer working, compile and run the test_suite as a first step. Next, use your buildfileindex binary to produce an index file (we suggest indexing ./test_tree/enron_email as a good test case). After that, use the solution_binaries/filesearchshell program that we provide, passing it the name of the index file your buildfileindex produces, to see if it’s able to successfully parse the file and issue queries against it. If not, you need to fix some bugs before you move on!

Performance on attu

If you write the index files to your personal directories on a CSE lab machine or on attu, you may find that the program runs very slowly. That’s because home directories on those machines are actually on a file server, and buildfileindex does a huge number of small write operations, which can be quite slow over the network. To speed things up dramatically we suggest you write the files in /tmp, which is a directory on a local disk attached to each machine. Be sure to remove the files when you’re done so the disk doesn’t fill up.

As an even more rigorous test, try running the hw3fsck program we’ve provided in solution_binaries against the index that you’ve produced. hw3fsck scans through the entire index, checking every field inside the file for reasonableness. It tries to print out a helpful message if it spots some kind of problem.

Once you pass hw3fsck and once you’re able to issue queries against your file indexes, then rerun your buildfileindex program under valgrind and make sure that you don’t have any memory leaks or memory errors.

Congrats, you’ve passed part A of the assignment!

Part B: finish our index lookup code

Now that you have a working memory-to-file index writer, the next step is to implement code that knows how to read an index file and lookup query words and docids against it. We’ve given you the scaffolding of the implementation that does this, and you’ll be finishing our implementation.

What to do

As one more hint, once you think you have this working, move on to finish our filesearchshell implementation. You’ll be able to test the output of your filesearchshell against ours (in solution_binaries/) as a final sanity check.

Also, now would be a great time to run valgrind over the unit tests to verify you have no memory leaks or memory errors.

You’re done with part B!

Part C: implement the search shell

For Part C, your job is to implement a search shell, just like in HW2, but this time using your HW3 infrastructure you completed in parts A and B.

Congrats, you’re done with (the mandatory parts) of HW3!!

Bonus

There are three bonus tasks for this assignment. As before you can do none of them with no impact on your grade. Or, you can do one, two, or all three of them if you’re feeling inspired!

If you do any of the bonus parts, first create a tar file with the required parts of your project to turn in separately. That will allow us to more easily evaluate how well you did on the basic requirements of the assignment. See the what to turn in section below for more details.

What to turn in

When you’re ready to turn in your assignment, do the following:

$ make clean
$ cd ..
$ tar czf hw3_<username>.tar.gz hw3
$ # make sure the tar file has no compiler output files in it, but
$ # does have all your source and other files you intend to submit
$ tar tzf hw3_<username>.tar.gz

Grading

We will be basing your grade on several elements: