Out: Monday, April 10, 2023
Due: Thursday, April 27, 2023 by 11:59 pm
In this assignment, you will use the LinkedList and HashTable modules that you built in Homework 1 in order to finish our implementation of a file system crawler, indexer, and search engine. Be sure to follow good C practices and maintain good code quality as you complete the assignment.
bug_journal.md
.As before, please read through this entire document before beginning the assignment, and please start early! There is a fair amount of coding you need to do to complete this assignment, and it will definitely expose any conceptual weaknesses you have with the prior material on C, pointers, malloc/free, and the semantics of the LinkedList and HashTable implementations.
You're going to write a module that reads the contents of a text file into memory and then parses the text file to look for words within it. As it finds words, it will build up a HashTable that contains one record for each word. Each record will contain a lowercase copy of the word, and also a sorted linked list. Each record of the linked list contains an offset within the file that the word appeared in (the first character in the file has offset zero). However, our word parser won't be very smart. It will treat as a word any non-zero sequence of alphabetic characters separated by non-alphabetic characters.
So, graphically, what your module will do is take a text file that contains something like this, where '\n' represents the newline control character that appears at the end of each line in the input file:
My goodness! I love the course CSE333.\n I'll recommend this course to my friends.\n
and produces a data structure that looks like this:
Specifically, note a few things:
FNVHash64()
function, passing the
string as the first argument, and the strlen(string)
as the second argument.You should follow these steps:
git pull
to retrieve new folders containing
the starter code for hw2 and a directory containing test data
for the remaining parts of the project.
Check that you have everything.
$ git pull ...git output... $ ls cpplint.py exercises gtest hw0 hw1 hw2 projdocs
(Note that you still need the hw1 directory; hw2 won't build properly without it. It's ok if you've deleted your hw0 directory.)
hw2
, and note a few things:
libhw1/
.
It has symbolic links to header files and your
libhw1.a
from ../hw1
.
Therefore you should make sure you have compiled everything
in ../hw1
while working on hw2.libhw1.a
is in
../hw1/solution_binaries
; just copy it over your
../hw1/libhw1.a
.Makefile
that compiles the project, a series of
files (i.e., test_*
) that contain our
unit tests, and some files (DocTable.c
,
DocTable.h
, CrawlFileTree.c
,
CrawlFileTree.h
, FileParser.c
,
FileParser.h
, MemIndex.c
,
MemIndex.h
, searchshell.c
) that
contain our partial implementation of the project../test_suite
.
It should fail (and might not even terminate), since most of
the implementation is missing!test_tree/
that contains a bunch of text files and
subdirectories containing more text files.
Explore this subdirectory and its contents; start with the
README.TXT
file.FileParser.c
.
Start by reading through FileParser.h
and make sure
you understand the semantics of the functions.
Also, look at the WordPositions
structure defined in
fileparser.h
and compare it to the figure above.
The function ParseIntoWordPositionsTable()
builds
a HashTable that looks like the figure, and each value in the
HashTable contains a heap-allocated WordPositions
structure.FileParser.c
to get a sense of its layout, and look
for all occurrences of STEP X (e.g., STEP 1, STEP 2,
...) for where you need to add code.
Be sure to read the full file before adding any code, so you can
see the full structure of what we want you to do.
Once you're finished adding code, run the test suite and you
should see some tests start to succeed!solution_binaries/
we've provided you with Linux
executables (i.e., test_suite
and
searchshell
) that were compiled with our complete,
working version of HW2.
You can run them to see what should happen when your HW2 is
working.static
) functions when that makes sense.At a high-level, a search engine has three major components: a crawler, an indexer, and a query processor. A crawler explores the world, looking for documents to index. The indexer takes a set of documents found by the crawler, and produces something called an inverted index out of them. A query processor asks a user for a query, and processes it using the inverted index to figure out a list of documents that match the query.
Your file system crawler will be provided with the name of a directory in which it should start crawling. Its job is to look through the directory for documents (text files) and to hand them to the indexer, and to look for subdirectories; it recursively descends into each subdirectory to look for more documents and sub-sub-directories. For each document it encounters, it will assign the document a unique "document ID", or "docID". A docID is just a 64-bit unsigned integer.
Your crawler itself will build two hash tables in memory, adding to them each time it discovers a new text file. The two hash tables map from docID to document filename, and from document filename to docID:
For each document the crawler finds, it will make use of your Part
A code to produce a word hashtable using
ParseIntoWordPositionsTable()
.
This is the heart of the search engine.
The job of the indexer is to take each word hashtable produced by
ParseIntoWordPositionsTable()
, and fold its contents
in to an inverted index.
An inverted index is just a hash table that maps from a word to a
"posting list," and a posting list is just a list of places that
word has been found.
Specifically, the indexer should produce an in-memory hash table that looks roughly like this:
Walking through it, the inverted index is a hash table that maps from a (hash of a) word to a structure. The structure (shown in green) contains the word as a string, and also a HashTable. The HashTable maps from the docID (not the hash of docID) to a LinkedList. The LinkedList contains each position that word appeared in that docID.
So, based on the figure, you can see that the word "course" appeared in a single document with docID 3, at byte offsets 25 and 62 from the start of file. Similarly, the word "love" appears in three documents: docID 1 at positions 7 and 92, docID 3 at position 16, and docID 4 at positions 18, 21, and 55.
The bulk of the work in this homework is in this step. We'll tackle it in parts.
DocTable.h
; this is the public
interface to the module that builds up the docID-to-docname
HashTable and the docname-to-docID HashTable.
Make sure you understand the semantics of everything in that
header file; note how we can now implement procedural-style class
composition!
We create a single DocTable structure contains both of these
tables, so when you implement DocTable_Allocate()
,
you'll end up malloc'ing a structure that contains two
HashTables, and you'll allocate each of those HashTables.DocTable.c
;
this is our partially completed implementation.
Be sure to read the full file.
Your job, as always, is to look for the "STEP X" comments and
finish our implementation.
Once you've finished the implementation, re-compile and re-run
the test_suite
executable to see if you pass our
tests.
If not, go back and fix some bugs!CrawlFileTree.h
; this is the public interface to our
file crawler module.
Make sure you understand the semantics of everything in that
header file.
Next, read through the full CrawlFileTree.c
and then
complete our implementation.
Once you're ready, re-compile and re-run the
test_suite
executable to see if you pass more tests.
If not, go back and fix some bugs!MemIndex.h
.
This is the public interface to the module that builds up the
in-memory inverted index.
Make sure you understand the semantics of everything in that
header file and note how we implement procedural-style
inheritance!
Next, read the full MemIndex.c
, and then complete
our implementation.
(This is the most involved part of the assignment.)
Once you're ready, re-compile and re-run the
test_suite
executable to see if you pass our tests.
If not, go back and fix some bugs!Now that you have a working inverted index, you're ready to build your search engine. The job of a search engine is to receive queries from a user, and return a list of matching documents in some rank order.
For us, a query is just a string of words, such as:
course friends my
The goal of our search engine is to find all of the documents that contain all of the words. So, if a document has the words "course" and "my" but not the word "friends," it shouldn't match.
To execute a query, first the query processor must split the query
up into a list of words (the strtok_r()
function is
useful for this).
Next, it looks up in the inverted index the list of documents that
match the first word.
This is our initial matching list.
Next, for each additional word in the query, the query processor uses the inverted index to get access to the HashTable that contains the set of matching documents. For each document in the matching list, the query processor tests to see if that document is also in the HashTable. If so, it keeps the document in the matching list, and if not, it removes the document from the matching list.
Once the processor has processed all of the words, it's done. Now, it just has to rank the results, sort the results by rank, and return the sorted result list to the user.
For us, our ranking function is very simple: given a document that matches against a query, we sum up the number of times each query word appears in the document, and that's our rank. So, if the user provides the query "foo" and that words appears on a document 5 times, the rank for that document given that query is 5. If the user provides a multi-word query, our sum includes the number of times each of the words in the query appears. So, if the user provides the query "foo bar", the word "foo" appears in the document 5 times, and the word "bar" appears 10 times, the rank of the document for that query is 15. The bigger the sum, the higher the rank, and therefore the "better" the search result.
We have provided a mostly empty skeleton file
searchshell.c
.
It is up to you to complete it by implementing a program that uses
the Linux console to interact with the user.
When you run the query processor (called searchshell
— you can try a working searchshell
in the
solution_binaries/
directory), it takes from a
command-line argument the name of a directory to crawl.
After crawling the directory and building the inverted index, it
enters a query processing loop, asking the user to type in a search
string and printing the results.
If the user signals end-of-file when the program asks for a search
string (i.e., control-D from the Linux terminal
keyboard), the program should clean up any allocated resources
– particularly memory – and shut down.
When you are ready, try running ./searchshell
to
interact with your program and see if your output matches the
.
Alternatively, compare your searchshell to the searchshell we
provided in the solution_binaries/
directory.
Note that our ranking function does not specify an ordering for two
documents with the same rank.
Documents that have the same rank may be listed in any order, and
that might be different from the ordering in the sample transcript
or produced by the solution_binaries
version of
searchshell
.
If you do work on either of the bonus tasks, you must also
include a hw2-bonus
tag in your repository.
While grading, we will use whichever commit has that tag to examine
the bonus, so it may be the same or a different commit from the one
that has hw2-final
.
If you do not have a hw2-bonus
tag in your repository,
we will assume you did not choose to submit anything for the bonus.
When searching text documents for a given query, some words in the query do not add any value to the quality of the search results. These words are known as stop words and include common words such as the, is, and, at, which, there, on, and so on. Some Web search engines filter these stop words, excluding them from both their indexes and ignoring them in queries; this results in both better search quality and significantly more efficient indexes and query execution.
Implement a stop word filter. The search shell
should accept a second, optional argument "-s" that
will act as a flag for turning the filter on.
When the flag is not specified, your program should not filter
any stop words.
It is up to you to decide how you will implement the stop word
filter (and where you'll get your list of stop words), but be
sure to create a README.md
file and explain in
that file what changes you had to make and how your filter
works.
Stop words that have apostrophes in them should be handled the
same way that you've handled the words in the documents.
You probably noticed that we went to a lot of trouble to have you include word position information in our inverted index posting lists, but we didn't make use of it. In this bonus task, you will! In addition to letting you search for words, modern search engines also let you search for phrases. For example, I could specify the following query:
alice "cool fountains" flowers
This query would match all documents that contain all of the following:
Using the positions information in the inverted index postings
list, implement support for phrase search.
You'll have to also modify the query processor to support
phrase syntax; phrases are specified by using quotation marks.
Be sure to create a README.md
file that describes
what changes you had to make to get phrasing to work.
As with hw1, you can compile your implementation by using the
make
command.
This will result in several output files, including an executable
called test_suite
.
After compiling your solution with make
, you can run
all of the tests for the homework by running:
$ ./test_suite
You can also run only specific tests by passing command-line
arguments into test_suite
.
This is extremely helpful for debugging specific parts of the
assignment, especially since test_suite
can be run
with these settings through valgrind
and
gdb
!
Some examples:
FileParser
tests, enter:
$ ./test_suite --gtest_filter=Test_FileParser.*
ReadFileToString
from
FileParser
, enter:
$ ./test_suite --gtest_filter=Test_FileParser.ReadFileToString
You can specify which tests are run for any of the tests in the assignment — you just need to know the names of the tests! You can list them all out by running:
$ ./test_suite --gtest_list_tests
test_suite
code.
Be sure to also run your code on small sample files and directories
where you can predict in advance exactly what data structures
should be created and what their contents should be, and then use
gdb or other tools to verify that things are working
exactly as expected.