Out: Tuesday, May 2, 2023
Due: Thursday, May 18, 2023 by 11:59 pm
In this assignment, you will build on top of your Homework 2 implementation and move the inverted index to an on-disk index:
bug_journal.md
.As before, please read through this entire document before beginning the assignment, and please start early! As usual, there is a lot of code involved with HW3, and since this is your first serious attempt to use C++ you should expect to encounter a lot of problems along the way. Also, manipulating on-disk data is trickier than in-memory data structures, so plan for some time to get this part right.
.cc
) files.Verify333()
's to spot errors and cause
your program to crash out if they occur.
Be sure to only use Verify333()
if the function
doesn't specify what it will do on error.The following videos are intended to help you double-check your understanding of the index file format used in HW3 as well as demonstrate the debugging process. It is recommended that you watch these AFTER reading through the specs.
Note: You can click on the labeled timestamps in the "Contents" menu to jump between sections of the video.
Keeping a search engine index in memory is problematic, since memory is expensive and also volatile. So, in Part A, you're going to write some C++ code that takes advantage of your HW2 implementation to first build an in-memory index of a file subtree, and then it will write that index into an index file in an architecture-neutral format.
What do we mean by architecture-neutral? Every time we need to store an integer in the file's data structure, we will store it in big endian representation. This is the representation that is conventionally used for portability, but the bad news is that this is the opposite representation than most computers you use: x86 computers are little endian. So, you will need to convert integers (whether 16-bit, 32-bit, or 64-bit) into big endian before writing them into the file. We provide you with helper functions to do this.
The good news is that we're going to keep roughly the same data structure inside the file as you built up in memory: we'll have chained hash tables that are arrays of buckets containing linked lists. And, our inverted index will be a hash table containing a bunch of embedded hash tables. But, we need to be very precise about the specific layout of these data structures within the file. So, let's first walk through our specification of an index file's format. We'll do this first at a high level of abstraction, showing the major components within the index file. Then, we'll zoom into these components, showing additional details about each.
At a high-level, the index file looks like the figure on the right. The index file is split into three major pieces: a header, the doctable, and the index. We'll talk about each in turn.
An index file's header contains metadata about the rest of the index file:
0xCAFEF00D
.
We will always write the magic number out as the last step in
preparing an index file.
This way, if the program crashes partway through writing one, the
magic number will be missing, and it will be easy to tell that
the index file is corrupt.
Let's drill down into the next level of detail by examining the
content of the doctable region of the file.
The doctable is a hash table that stores a mapping from 64-bit
document ID to an ASCII string representing the document's
filename.
This is the docid_to_docname
HashTable that you built
up in HW2, but stored in the file rather than in memory.
The doctable consists of three regions; let's walk through them, and then drill down into some details:
num_buckets
: This region is the simplest;
it is just a 32-bit big endian integer that represents the number
of buckets inside the hash table, exactly like you stored in your
HashTable.bucket_rec
records: This region contains one record for each bucket in
the hash table.
A bucket_rec
record is 8 bytes long, and it consists
of two four-byte fields.
The chain len field is a four byte integer that tells
you the number of elements in the bucket's chain.
(This number might be zero if there are no elements in that
chain!)
The bucket position
field is a four-byte
integer that tells you the offset of the bucket data
(i.e., the chain of bucket elements) within the index
file.
The offset is just like a pointer in memory or an index of an
array, except that it points within the index file.
For example, an offset of 0 would indicate the first byte of the
file, an offset of 10 would indicate the 11th byte of the file,
and so on.The index is the most complicated of the three regions within the index file. The great news is that it has pretty much the same structure as the doctable: it is just a hash table, laid out exactly the same way. The only part of the index that differs from the doctable is the structure of each element, so let's focus on that.
An index maps from a word to an embedded docID hash table, or docID table. So, each element of the index contains enough information to store all of that. Specifically, an index table element contains:
So, all we need to understand now is the format of the docID table. We're sure its format will come as no surprise at this point...
docIDtable
Like the "doctable" table, each embedded "docIDtable" table within
the index is just a hash table!
A docIDtable
maps from a 64-bit docID to a list of
positions with that document that the word can be found in.
So, each element of the docID table stores exactly that:
So, putting it all together, the entire index file contains a header, a doctable (a hash table that maps from docID to filename), and an index. The index is a hash table that maps from a word to an embedded docIDtable. The docIDtable is a hash table that maps from a document ID to a list of word positions within that document.
The bulk of the work in this homework is in this step. We'll tackle it in parts.
git pull
to retrieve the new hw3/
folder for this assignment.
You still will need the hw1/
, hw2/
, and
projdocs/
directories in the same folder as your new
hw3
folder since hw3/
links to files in
those previous directories, and the test_tree
folder
also still needs to be present.hw3/
to
familiarize yourself with the structure.
Note that there is a libhw1/
directory that contains
a symlink to your libhw1.a
, and a
libhw2/
directory that contains a symlink to your
libhw2.a
.
You can replace your libraries with ours (from the appropriate
solution_binaries/
directories) if you prefer.make
to compile the three
HW3 binaries.
One of them is the usual unit test binary.
Run it, and you'll see the unit tests fail, crash out, and you
won't yet earn the automated grading points tallied by the test
suite.Utils.h
and
LayoutStructs.h
.
These header files contains some useful utility routines and
classes you'll take advantage of in the rest of the assignment.
We've provided the full implementation of Utils.cc
.
Next, look inside WriteIndex.h
; this header file
declares the WriteIndex()
function, which you will
be implementing in this part of the assignment.
Also, look inside buildfileindex.cc
; this file makes
use of WriteIndex()
and your HW2
CrawlFileTree()
, to crawl a file subtree and write
the resulting index out into an index file.
Try running the solution_binaries/buildfileindex
program to build one or two index files for a directory subtree,
and then run the solution_binaries/filesearchshell
program to try out the generated index file.WriteIndex.cc
and take a look around inside.
It looks complex, but all of the helper routines and major
functions correspond pretty directly to our walkthrough of the
data structures above.
Start by reading through WriteIndex()
; we've given
you part of its implementation.
Then, start recursively descending through all the functions it
calls, and implement the missing pieces.
(Look for STEP:
in the text to find what you need
to implement.)test_suite
as a first step.
Next, use your buildfileindex
binary to produce an
index file (we suggest indexing something small, like
./test_tree/tiny
as a good test case).
After that, use the solution_binaries/filesearchshell
program that we provide, passing it the name of the index file
that your buildfileindex
produces, to see if it's
able to successfully parse the file and issue queries against it.
If not, you need to fix some bugs before you move on!
buildfileindex
does a huge number
of small write operations, which can be quite slow over the
network.
To speed things up dramatically we suggest you write the index
files into /tmp
, which is a directory on a local
disk attached to each machine.
Be sure to remove the files when you're done so the disk
doesn't fill up.
hw3fsck
program we've provided in
solution_binaries
against the index that you've
produced.
hw3fsck
scans through the entire index, checking
every field inside the file for reasonableness.
It tries to print out a helpful message if it spots some kind of
problem.
Once you pass hw3fsck
and once you're able to issue
queries against your file indices, then rerun your
buildfileindex
program under valgrind and make sure
that you don't have any memory leaks or other memory errors.Now that you have a working memory-to-file index writer, the next step is to implement code that knows how to read an index file and lookup query words and docids against it. We've given you the scaffolding of the implementation that does this, and you'll be finishing our implementation.
FileIndexReader.h
.
Notice that we're now in full-blown C++ land; you'll be
implementing constructors, manipulating member variables and
functions, and so on.
Next, open up FileIndexReader.cc
.
Your job is to finish the implementation of its constructor,
which reads the header of the index file and stores various
fields as private member variables.
As above, look for "STEP:"
to figure out what you
need to implement.
When you're done, recompile and re-run the test suite.
You should pass all the tests for
test_fileindex_reader.cc
once you have implemented
FileIndexReader
successfully.HashTableReader.h
.
Read through it to see what the class does.
Don't worry about the copy constructor and assignment operator
details (though if you're curious, read through them to see what
they're doing and why).
This class serves as a base class for other subclasses.
The job of a HashTableReader
is to provide most of
the generic hash-table lookup functionality; it knows how to look
through buckets and chains, returning offsets to elements
associated with a hash value.
Open up HashTableReader.cc
.
Implement the "STEP:"
components in the constructor
and the LookupElementPositions
function.
When you're done, recompile and run the unit tests to see if you
pass test_hashtablereader.cc
unit test.DocTableReader.h
.
Read through it, and note that it is a subclass of
HashTableReader
.
It inherits LookupElementPositions()
and other
aspects, but provides some new functionality.
Next, open up DocTableReader.cc,
and implement the
"STEP:"
functionality.
See how well you do on its unit test (and valgrind) when you're
done.IndexTableReader.h
.
Read through it and understand its role.
Next, open up IndexTableReader.cc
and implement the
"STEP:"
functionality.
Test against the unit tests (and valgrind).DocIDTableReader.h
and
DocIDTableReader.cc
.QueryProcessor.h
and read through how it is
supposed to work.
Check out test_queryprocessor.cc
for more
information.
Now open up QueryProcessor.cc
and read through our
implementation of the constructor and destructor.
ProcessQuery()
, and also a clue about what you
should be building up and returning.
But, we've given you nothing about its implementation.
You get to implement it entirely on your own; you might want to
define helper private member functions, you might want to
define other structures to help along the way, etc.; it's
entirely up to you.
But, once you're finished, you'll need to pass our unit test to
know you've done it correctly.
sort
).
Remember that processing a query against an index means
ensuring all query words are present in each matching
document, and remember how ranking works.
Then, once you have query results from each index, you'll
append them all together to form your final query results.filesearchshell
implementation.
You'll be able to test the output of your
filesearchshell
against ours (in
solution_binaries/
) as a final sanity check.For Part C, your job is to implement a search shell, just like in HW2, but this time using your HW3 infrastructure you completed Parts A and B.
filesearchshell.cc
and
read through it.
Note that unlike Parts A and B, we have given you almost nothing
about the implementation of the filesearchshell
besides a really long (and hopefully helpful) comment.
Implement filesearchshell.cc
.filesearchshell
binary.
You can compare the output of your binary against a
.
The transcripts should match precisely, except perhaps for the
order of equally ranked matches.
You can also walk your filesearchshell
against a
very tiny index (tiny.idx
) in the debugger to see
if it's reading the correct fields and jumping to the correct
offsets.filesearchshell
.
filesearchshell
ought to clean up all allocated
memory before exiting.
So, run your filesearchshell
under valgrind to
ensure there are no leaks or errors.
If you want to do any of the bonus parts, first create a
hw3-final
tag in your repository to mark the version
of the assignment with the required parts of the project.
That will allow us to more easily evaluate how well you did on the
basic requirements of the assignment.
REMINDER: you may not modify the header files for the normal
submission.
Any bonus you do most not be present in the hw3-final tag,
otherwise it could mess up the results of your submisison when we
go to grade it.
Then, when you are done adding additional bonus parts, create a
new tag hw3-bonus
after committing your additions,
and push the additions and the new tag to your GitLab repository.
If we find a hw3-bonus
tag in your repository we'll
grade the extra credit parts; otherwise we'll assume that you just
did the required parts.
For these tasks you may modify the header files to add additional
members, but do not modify the existing ones and do not have these
changes show up under your hw3-final tag, these changes should only
be present under hw3-bonus
.
(word frequency) / (number of words in the document)
.
Do an informal study that evaluates whether this ranking is
better or worse, and present your evidence.
As with the previous homework, you can compile the your implementation
by using the make
command.
This will result in several output files, including an executable
called test_suite
.
After compiling your solution with make
, you can run
all of the tests for the homwork by running:
$ ./test_suite
You can also run only specific tests by passing command-line
arguments into test_suite
.
This is extremely helpful for debugging specific parts of the
assignment, especially since test_suite
can be run
with these settings through valgrind
and
gdb
!
Some examples:
QueryProcessor
tests, enter:
$ ./test_suite --gtest_filter=Test_QueryProcessor.*
QueryProcessor
, enter:
$ ./test_suite --gtest_filter=Test_QueryProcessor.TestQueryProcessorSingleIndex
You can specify which tests are run for any of the tests in the assignment — you just need to know the names of the tests! You can list them all out by running:
$ ./test_suite --gtest_list_tests
WriteIndex
tests can take a while to
run, so you can run all tests except those, using:
$ ./test_suite --gtest_filter=-Test_WriteIndex.*