Out: Tue, Oct 29
Due: Thu, Nov 14 by 8:59 pm
You will build on top of your HW2 implementation: In Part A, you will write code that takes an in-memory inverted index produced by HW2 and writes it out to disk in an architecture-neutral format. In Part B, you will write C++ code that walks through an on-disk index to service a lookup. Finally, in Part C, you will write a query processor that serves queries from multiple on-disk indices.
As before, please read through this entire document before beginning the assignment, and please start early! As usual, there is a lot of code involved with HW3, and since this is your first serious attempt to use C++ you should expect to encounter a lot of problems along the way. Also, manipulating on-disk data is trickier than in-memory data structures, so plan for some time to get this part right.
In HW3, as with HW2, you don't need to worry about propagating errors back to callers in all situations. You will use Verify333()'s to spot errors and cause your program to crash out if they occur. We will not be using C++ exceptions in HW3.
Also, as before, you may not modify any of the existing header files or class definitions distributed with the code. If you wish to add extra "helper" functions you can to do that by including additional static functions in the implementation (.cc) files.
The following videos are intended to help you double-check your understanding of the index file format used in HW3 as well as demonstrate the debugging process. It is recommended that you watch these AFTER reading through the specs.
Note: You can click on the labeled timestamps in the "Contents" menu to jump between sections of the video.
Keeping a search engine index in memory is problematic, since memory is expensive and also volatile. So, in Part B, you're going to write some C++ code that takes advantage of your HW2 implementation to first build an in-memory index of a file subtree, and then it will write that index into an index file in an architecture-neutral format.
What do we mean by architecture-neutral? Every time we need to store an integer in the file's data structure, we will store it in big endian representation. This is the representation that is conventionally used for portability, but the bad news is that this is the opposite representation than most computers you use: x86 computers are little endian. So, you will need to convert integers (whether 16-bit, 32-bit, or 64-bit) into big endian before writing them into the file. We provide you with helper functions to do this.
The good news is that we're going to keep roughly the same data structure inside the file as you built up in memory: we'll have chained hash tables that are arrays of buckets containing linked lists. And, our inverted index will be a hash table containing a bunch of embedded hash tables. But, we need to be very precise about the specific layout of these data structures within the file. So, let's first walk through our specification of an index file's format. We'll do this first at a high level of abstraction, showing the major components within the index file. Then, we'll zoom into these components, showing additional details about each.
At a high-level, the index file looks like the figure on the right. The index file is split into three major pieces: a header, the doctable, and the index. We'll talk about each in turn.
Header: an index file's header contains metadata about the rest of the index file.
The first four bytes of the header are a
magic number,
or format indicator. Specifically, we use the 32-bit
number 0xCAFEF00D
. We will always write
the magic number out as the last step in preparing an index
file. This way, if your program crashes partway through
writing one, the magic number will be missing, and it will
be easy to tell that the index file is corrupt.
The next four bytes are a checksum of the doctable and index regions of the file. A checksum is a mathematical signature of a bunch of data, kind of like a hash value. By including a checksum of most of the index file within the header, we can tell if the index file has been corrupted, such as by a disk error. If the checksum stored in the header doesn't match what we recalculate when opening an index file, we know the file is corrupt and we can discard it.
The next four bytes store the size of the doctable region of the file. The size is stored as a 32-bit, signed, big endian integer.
The final four bytes of the header store the size of the index region of the file, in exactly the same way.
Doctable: Let's drill down into the next level of
detail by examining the content of the doctable region
of the file. The doctable is a hash table that stores a
mapping from 64-bit document ID to an ASCII string
representing the document's filename. This is the
docid_to_docname
HashTable that you built up in
HW2, but but stored in the file rather than in memory.
The doctable consists of three regions; let's walk through them, and then drill down into some details.
num_buckets
: this region is the
simplest; it is just a 32-bit big endian integer that
represents the number of buckets inside the hash table,
exactly like you stored in your HashTable.
bucket_rec
records:
this region contains one record for each bucket in the hash
table. A bucket_rec
record is 8 bytes long,
and it consists of two four-byte fields. The
chain len field is a four byte integer that tells
you the number of elements in the bucket's chain. (This
number might be zero if there are no elements in that
chain!) The bucket position
field is
a four byte integer tells you the offset of the bucket
data (i.e., the chain of bucket elements) within the
index file. The offset is just like a pointer in
memory, or an index of an array, except it points
within the index file. So, for example, an offset of 0 would
indicate the first byte of the file, an offset of 10 would
indicate the 11th byte of the file, and so on.
Index: The index is the most complicated of the three regions within the index file. The great news is that it has pretty much the same structure as the doctable: it is just a hash table, laid out exactly the same way. The only part of the index that differs from the doctable is the structure of each element. Let's focus on that.
An index maps from a word to an embedded docID hash table, or docID table. So, each element of the index contains enough information to store all of that. Specifically, an index table element contains:
docIDtable:
like the "doctable" table,
each embedded "docIDtable" table within the index is just
a hash table! A docIDtable
maps from a 64-bit
docID to a list of positions with that document that the
word can be found in. So, each element of the docID table
stores exactly that:
So, putting it all together, the entire index file contains a header, a doctable (a hash table that maps from docID to filename), and an index. The index is a hash table that maps from a word to an embedded docIDtable. The docIDtable is a hash table that maps from a document ID to a list of word positions within that document.
The bulk of the work in this homework is in this step. We'll tackle it in parts.
git pull
to retrieve the
new hw3/
folder for this assignment. You still
will need the hw1/
, hw2/
, and
projdocs
directories in the same folder as
your new hw3
folder since hw3/
links to files in those
previous directories, and the test_tree
folder
also still needs to be present.hw3/
to familiarize
yourself with the structure. Note that there is a
libhw1/
directory that
contains a symlink to your libhw1.a
, and a
libhw2/
directory that contains a symlink
to your libhw2.a
. You can replace your
libraries with ours (from the appropriate
solution_binaries
directories) if you prefer.
make
to compile the three HW3
binaries. One of them is the usual unit test binary. Run
it, and you'll see the unit tests fail, crash out, and
you won't yet earn the automated grading points tallied
by the test suite.Utils.h
and
LayoutStructs.h
. These header files
contains some useful utility routines and classes you'll take
advantage of in the rest of the assignment. We've provided
the full implementation of Utils.cc
. Next,
look inside WriteIndex.h
; this header file
declares the WriteIndex()
function, which
you will be implementing in this part of the
assignment. Also, look inside buildfileindex.cc; this file
makes use of WriteIndex()
and your HW2
CrawlFileTree()
, to crawl a file subtree
and write the resulting index out into an index file.
Try running the solution_binaries/buildfileindex
program to build one or two index files for a directory
subtree, and then run the
solution_binaries/filesearchshell
program to
try out the generated index file.WriteIndex.cc
and take a look around inside.
It looks complex, but all of the helper routines and
major functions correspond pretty directly to our
walkthrough of the data structures above. Start by
reading through WriteIndex()
; we've
given you part of its implementation. Then, start
recursively descending through all the functions it
calls, and implement the missing pieces. (Look for
STEP:
in the text to find what you
need to implement.)test_suite
as a first step. Next, use
your buildfileindex
binary to produce an index
file (we suggest indexing something small, like
./test_tree/enron_email
as a good test case).
After that, use the
solution_binaries/filesearchshell
program that
we provide, passing it the name of the index file that your
buildfileindex
produces, to see if it's able
to successfully parse the file and issue queries against
it. If not, you need to fix some bugs before you move
on!
[Aside: If you write the index files to your personal
directories on a CSE lab machine or on attu, you may
find that the program runs very slowly. That's
because home directories on those machines
on a network file server, and buildfileindex
does a huge number of small write operations, which can
be quite slow over the network. To speed things up
dramatically we suggest you write the index files
into /tmp
, which is a directory on
a local disk attached to each machine. Be sure to
remove the files when you're done so the disk
doesn't fill up.]
As an even more rigorous test, try running the
hw3fsck
program we've provided in
solution_binaries
against the index that
you've produced. hw3fsck
scans through
the entire index, checking every field inside the file
for reasonableness. It tries to print out
a helpful message if it spots some kind of problem.
Once you pass hw3fsck
and once you're
able to issue queries against your file indices, then
rerun your buildfileindex
program
under valgrind and make sure that you don't have any
memory leaks or memory errors.
Congrats, you've passed part A of the assignment!
Now that you have a working memory-to-file index writer, the next step is to implement code that knows how to read an index file and lookup query words and docids against it. We've given you the scaffolding of the implementation that does this, and you'll be finishing our implementation.
FileIndexReader.h
.
Notice that we're now in full-blown C++ land; you'll be
implementing constructors, manipulating member variables
and functions, and so on. Next, open up
FileIndexReader.cc
. Your job is to finish the
implementation of its constructor, which reads the header
of the index file and stores various fields as private
member variables. As above, look for "STEP:"
to figure out what you need to implement. When you're done,
recompile and re-run the test suite. You won't yet pass
all the tests for test_fileindexwriter.cc
, but
you'll get some points for it.HashTableReader.h
.
Read through it to see what the class does. Don't worry
about the copy constructor and assignment operator details
(though if you're curious, read through them to see what
they're doing and why). This class serves as a base class
for other subclasses. The job of a
HashTableReader
is to provide most of the
generic hash-table lookup functionality; it knows how to
look through buckets and chains, returning offsets to
elements associated with a hash value. Open up
HashTableReader.cc
. Implement the
"STEP:"
components in the constructor and
the LookupElementPositions
function. When
you're done, recompile and run the unit tests to see if you
pass test_hashtablereader.cc
unit test.
DocTableReader.h
.
Read through it, and note that it is a subclass of
HashTableReader
. It inherits
LookupElementPositions()
and other aspects,
but provides some new functionality. Next, open up
DocTableReader.cc,
and implement the
"STEP:"
functionality. See how well you do on
its unit test when you're done. (By the way, you have been
using valgrind to test for problems and memory leaks
along the way, right? We definitely will when grading
your assignment! ;)IndexTableReader.h
.
Read through it and understand its role. Next, open up
IndexTableReader.cc
and implement the
"STEP:"
functionality. Test against the unit
tests.DocIDTableReader
.h and
DocIDTableReader.cc
. Note that the
Test_QueryProcessor
unit test will hang in
an infinite loop once you've finished this step; once
you finish the next step that will get fixed.
QueryProcessor.h
and understand how it is supposed to work. Check out
test_queryprocessor.cc
for more information.
Now open up QueryProcessor.cc
and read through
our implementation of the constructor and destructor.
This part of the assignment is the most open-ended. We've
given you the function definition for
ProcessQuery()
, and also a clue aboud
what you should be building up and returning. But, we've
given you nothing about its implementation. You get to
implement it entirely on your own; you might want to
define helper private member functions, you might want
to define other structures to help along the way, etc.;
it's entirely up to you. But, once you're finished,
you'll need to pass our unit test to know you've
done it correctly.
As a hint, you should be able to take inspiration from
what you did to implement the query processor in HW2.
Here, it's only a little bit more complicated. You
want to process the query against each index, and then
intersect each index's results together and do a
final sort. (Use the STL's sort
). Remember
that processing a query against an index means ensuring all
query words are present in each matching document, and
remember how ranking works. Then, once you have query
results from each index, you'll append them all together
to form your final query results.
As one more hint, once you think you have this working,
move on to Part C and finish our
filesearchshell
implementation. You'll be
able to test the output of your filesearchshell
against ours (in solution_binaries/
) as a
final sanity check.
Also, now would be a great time to run valgrind over the unit tests to verify you have no memory leaks or memory errors.
You're done with part B!
For Part C, your job is to implement a search shell, just like in HW2, but this time using your HW3 infrastructure you completed parts A and B.
filesearchshell.cc
and read through
it. Note that unlike parts A and B, we have given you
almost nothing about the implementation of the
filesearchshell
besides a really long
(and hopefully helpful) comment. Implement
filesearchshell.cc
.filesearchshell
binary. You
can compare the output of your binary against a
transcript of our solution.
The transcripts should match precisely, except perhaps for
the order of equally ranked matches.filesearchshell
. filesearchshell
ought to clean up all allocated memory before exiting. So,
run your filesearchshell
under valgrind
to ensure there are no leaks or errors.
Congrats, you're done with (the mandatory parts) of HW3!!
There are three bonus tasks for this assignment. As before you can do none of them with no impact on your grade. Or, you can do one, two, or all three of them if you're feeling inspired!
If you want to do any of the bonus parts, first
create a hw3-final
tag in your repository to
mark the version of the assignment with the required parts
of the project. That will allow us to more easily evaluate
how well you did on the basic requirements of the assignment.
Then, when you are done adding additional bonus parts, create
a new tag hw3-bonus
after committing your
additions, and push the additions and the new tag to your
GitLab repository. If we find a hw3-bonus
tag
in your repository we'll grade the extra credit parts;
otherwise we'll assume that you just did the required
parts.
(word frequency) /
(number of words in the document)
.
Do an informal study that evaluates whether this ranking is better or worse, and present your evidence.
Implement tf-idf ranking. You'll have to populate a new hashtable for corpus word frequency and incoporate it into the index file format. Do an informal study that evaluates whether this ranking is better or worse than term frequency, and present your evidence.
When you are ready to turn in your assignment, you should follow
the same procedures you used for previous assignments, except
this time tag the repository with hw3-final
for
the required parts of the project. Remember to clean up,
commit, and push all necessary files to your repository before
you add the tag. After you have created and pushed the
tag, remember to test everything
by creating a new clone of the repository in a separate,
empty directory, checkout the hw3-final
tag,
and verify that everything works as expected. Refer to the
hw0 turnin instructions for details, and follow those
steps carefully.
If you do any of the bonus parts, create an additional
hw3-bonus
tag and push that after adding and pushing
the bonus code to the repository. Be sure to clone the repo,
checkout that tag, and verify that everything works as
expected. Also verify that the hw3-final
tag
is still present and that it includes (only) the required
parts of the project.
As with hw2, when you clone
your repository to check it, it will normally not include
hw1/libhw1.a
and hw2/libhw2.a
. You
should either run make
in those directories to
recreate those archives, or else copy the versions from the
solution_binaries/
folders into the right places.
These are needed in order to build hw3 and test it.
We will be basing your grade on several elements: