Out:   Friday, October 10, 2025
    Due:   Thursday, October 23, 2025 by 11:59 pm
    
In this assignment you will use the LinkedList and HashTable modules that you built in Homework 1 in order to finish our implementation of a file system crawler, indexer, and search engine:
As before, please read through this entire document before beginning the assignment, and please start early! There is a fair amount of coding you need to do to complete this assignment, and it will definitely expose any conceptual weaknesses you have with the prior material on C, pointers, malloc/free, and the semantics of the LinkedList and HashTable implementations.
You're going to write a module that reads the contents of a text file into memory and then parses the text file to look for words within it. As it finds words, it will build up a HashTable that contains one record for each word. Each record will contain a lowercase copy of the word, and also a sorted linked list. Each record of the linked list contains an offset within the file where the word appeared (the first character in the file has offset zero).
Our word parser won't be very smart. It will treat as a word any non-zero sequence of alphabetic characters separated by non-alphabetic characters.
So, graphically, what your module will do is take a text file that contains something like this:
My goodness! I love the course CSE333.\n I'll recommend this course to my friends.\n
(where '\n' represents the newline control character that appears at the end of each line in the input file) and produces a data structure that looks like this:
 
            Specifically, note a few things:
FNVHash64() function, passing
                the string as the first argument, and the
                strlen(string) as the second argument.
                
              You should follow these steps:
git pull to retrieve new folders containing
              the starter code for hw2 and a directory containing test data
              for the remaining parts of the project.  Check that you have
              everything.bash$ git pull ...git output... bash$ ls cpplint.py exercises gtest hw0 hw1 hw2 projdocs
             (Note that you still need the hw1 directory; hw2
             won't build properly without it.  It's ok if you've deleted your
             hw0 directory.)
hw2, and note a few things:
              
              libhw1/.  It
                  has symbolic links to header files and your
                  libhw1.a from ../hw1.  Therefore
                  you should make sure you have compiled everything in
                  ../hw1 while working on hw2.)
                  
                libhw1.a is in
                  ../hw1/solution_binaries; just copy it over your
                  ../hw1/libhw1.a.
                  
                Makefile that
                  compiles  the project, a series of files (i.e.,
                  test_*) that contain our unit tests, and some
                  files (DocTable.c, DocTable.h,
                  CrawlFileTree.c, CrawlFileTree.h,
                  FileParser.c, FileParser.h,
                  MemIndex.c, MemIndex.h,
                  searchshell.c) that contain our
                  partial implementation of the project.
                  
                make to compile the project, and try running the
                  test suite by running ./test_suite.  It should
                  fail (and might not even terminate), since
                  most of the implementation is missing!
              test_tree/ that
              contains a bunch of text files and subdirectories containing more
              text files.  Explore this subdirectory and its contents; start
              with the README.TXT file.
              
            FileParser.c.  Start by reading through
              FileParser.h and
              make sure you understand the semantics of the functions.  Also,
              look at the WordPositions structure defined in
              fileparser.h and compare it to the figure above.
              The function ParseIntoWordPositionsTable() builds
              a HashTable that looks like the figure, and each value in the
              HashTable contains a heap-allocated WordPositions
              structure.
              
            FileParser.c to get
              a sense of its layout, and look for all occurrences of STEP X
              (e.g., STEP 1, STEP 2, ...) for where you need to add code.  Be
              sure to read the full file before adding any code, so you can see
              the full structure of what we want you to do.  Once you're
              finished adding code, run the test suite and you should see some
              tests start to succeed!
              
            solution_binaries/
              we've provided you with linux executables (i.e.,
              test_suite and searchshell)
              that were compiled with our complete, working version
              of HW2.  You can run them to see what should happen
              when your HW2 is working.
              
            static) functions
              when that makes sense.
              
          At a high-level, a search engine has three major components: a crawler, an indexer, and a query processor. A crawler explores the world, looking for documents to index. The indexer takes a set of documents found by the crawler, and produces something called an inverted index out of them. A query processor asks a user for a query, and processes it using the inverted index to figure out a list of documents that match the query.
File system crawler: Your file system crawler will be provided with the name of a directory in which it should start crawling. Its job is to look through the directory for documents (text files) and to hand them to the indexer, and to look for subdirectories; it recursively descends into each subdirectory to look for more documents and sub-sub-directories. For each document it encounters, it will assign the document a unique "document ID", or "docID". A docID is just a 64-bit unsigned integer.
Your crawler itself will build two hash tables in memory, adding to them each time it discovers a new text file. The two hash tables map from docID to document filename, and from document filename to docID:

            For each document the crawler finds, it will make use of
            your part A code to produce a word hashtable using
            ParseIntoWordPositionsTable().
          
Indexer: This is the heart of the search engine.  The job
            of the indexer is to take each word hashtable produced by
            ParseIntoWordPositionsTable(), and fold its contents
            in to an inverted index.  An
            inverted index is easy to understand; it's just a hash table that
            maps from a word to a "posting list," and a posting list is just
            a list of places that word has been found.
          
Specifically, the indexer should produce an in-memory hash table that looks roughly like this:

Walking through it, the inverted index is a hash table that maps from a (hash of a) word to a structure. The structure (shown in green) contains the word as a string, and also a HashTable. The HashTable maps from the docID (not the hash of docID) to a LinkedList. The LinkedList contains each position that word appeared in that docID.
So, based on the figure, you can see that the word "course" appeared in a single document with docID 3, at byte offsets 25 and 62 from the start of file. Similarly, the word "love" appears in three documents: docID 1 at positions 7 and 92, docID 3 at position 16, and docID 4 at positions 18, 21, and 55.
The bulk of the work in this homework is in this step. We'll tackle it in parts.
DocTable.h; this is the public
              interface to the module that builds up the docID-to-docname
              HashTable and the
              docname-to-docID HashTable.  Make sure you understand the
              semantics of everything in that header file; note how we can now
              implement procedural-style class composition!  We create a single
              DocTable structure contains both of these tables, so when you
              implement DocTable_Allocate(), you'll end up
              malloc'ing a structure that contains two HashTables, and you'll
              allocate each of those HashTables.
              
            DocTable.c; this is our
              partially completed implementation.  Be sure to read the full
              file.  Your job, as always, is to look for the "STEP X" comments
              and finish our implementation.  Once you've finished the
              implementation, re-compile and re-run the test_suite
              executable to see if you pass our tests.  If not, go back and
              fix some bugs!
              
            CrawlFileTree.h; this is
              the public interface to our file crawler module.  Make sure you
              understand the semantics of everything in that header file.
              Next, read through the full CrawlFileTree.c and
              then complete our implementation.  Once you're ready, re-compile
              and re-run the test_suite executable to see if you
              pass more tests.  If not, go back and fix some bugs!
              
            MemIndex.h.  This is
              the public interface to the module that builds up the in-memory
              inverted index.  Make sure you understand the semantics of
              everything in that header file and note how we implement
              procedural-style inheritance!  Next, read the full
              MemIndex.c, and then complete
              our implementation.  (This is the most involved part of the
              assignment.)  Once you're ready, re-compile and re-run the
              test_suite executable to see if you our tests.  If not, go
              back and fix some bugs!
              
              Once you've passed all of the tests, re-rerun the test
                suites under valgrind and make sure you don't have any memory
                leaks.
              
Congrats, you've passed part B of the assignment!
Now that you have a working inverted index, you're ready to build your search engine. The job of a search engine is to receive queries from a user, and return a list of matching documents in some rank order.
For us, a query is just a string of words, such as:
course friends myThe goal of our search engine is to find all of the documents that contain all of the words. So, if a document has the words "course" and "my" but not the word "friends," it shouldn't match.
            To execute a query, first the query
            processor must split the query up into a list of words (the
            strtok_r() function is useful for this).  Next, it
            looks up in the inverted index the
            list of documents that match the first word.  This is our
            initial matching list.
          
Next, for each additional word in the query, the query processor uses the inverted index to get access to the HashTable that contains the set of matching documents. For each document in the matching list, the query processor tests to see if that document is also in the HashTable. If so, it keeps the document in the matching list, and if not, it removes the document from the matching list.
Once the processor has processed all of the words, it's done. Now, it just has to rank the results, sort the results by rank, and return the sorted result list to the user.
For us, our ranking function is very simple: given a document that matches against a query, we sum up the number of times each query word appears in the document, and that's our rank. So, if the user provides the query "foo" and that words appears on a document 5 times, the rank for that document given that query is 5. If the user provides a multi-word query, our sum includes the number of times each of the words in the query appears. So, if the user provides the query "foo bar", the word "foo" appears in the document 5 times, and the word "bar" appears 10 times, the rank of the document for that query is 15. The bigger the sum, the higher the rank, and therefore the "better" the search result.
We have provided a mostly empty skeleton file
            searchshell.c.
            It is up to you to complete it by implementing a program that uses
            the Linux console to interact with the user.  When you run the
            query processor (called searchshell -- you can try a
            working searchshell in the
            solution_binaries/ directory), it takes from a
            command line argument the name of a directory to crawl.  After
            crawling the directory and building the inverted index, it enters
            a query processing loop, asking the user to type in a search
            string and printing the results.  If the user signals end-of-file
            when the program asks for a search string (i.e., control-D from the
            linux terminal keyboard), the program should clean up any
            allocated resources - particularly memory - and shut down.
          
When you are ready, try running ./searchshell to
            interact with your program and see if your output matches the
            transcript from a search session with
            our solution.  Alternatively, compare your searchshell to
            the searchshell we provided in the solution_binaries/
            directory.  Note that our ranking function does not specify an
            ordering for two documents with the same rank.  Documents that
            have the same rank may be listed in any order, and that might
            be different from the ordering in the sample transcript or
            produced by the solution_binaries version of
            searchshell.
        
We're offering two bonus tasks for a tiny amount of extra credit. These are purely optional; if you choose not to do either, your grade won't be negatively affected at all. These are just if you happen to have the time and interest! You can do either or both of the bonus tasks.
            If you do work on either of the bonus tasks, you must also
            include a hw2-bonus tag in your repository. While
            grading, we will use whichever commit has that tag to examine the
            bonus, so it may be the same or a different commit from the one
            that has hw2-final.  If you do not have a
            hw2-bonus tag in your repository, we will assume you
            did not choose to submit anything for the bonus (again, which will
            not negatively affect your grade in any way!)
          
            Also, if you do either or both bonus parts of the project, you must add a
            readme-hw2-bonus.md text file to your hw2/ directory
            and include this in the files you push to your gitlab repo.  This file
            should contain a brief description of the bonus work you have done
            as requested below.
            This should be brief - a few sentences or a couple of paragraphs at the most
            for each extra part should probably be sufficient.
Implement a stop word filter. The search
                shell should accept a second, optional argument "-s"
                that will act as a flag for turning the filter on. When the
                flag is not specified, your program should not filter any
                stop words. It is up to you to decide how you will implement
                the stop word filter (and where you'll get your list of stop
                words), but be sure to explain in your readme-hw2-bonus.md file what
                changes you had to make and how your filter works.  Stop words
                that have apostrophes in them should be handled the same way
                that you've handled the words in the documents.
              
alice "cool fountains" flowersThis query would match all documents that contain all of the following:
                Using the positions information in the inverted index postings
                list, implement support for phrase search.  You'll have to
                also modify the query processor to support phrase syntax;
                phrases are specified by using quotation marks. Be sure to
                include in your readme-hw2-bonus.md file
                a description of the changes you had
                to make to get phrasing to work.
              
As with hw1, you can compile your implementation using the
make command. This will result in several output files, including
an executable called test_suite.
After compiling your solution with make, you can run all of the
tests for the homework by running:
bash$ ./test_suite
You can also run only specific tests by passing command line arguments
into test_suite.
For example, to only run the FileParser tests, you can type:
bash$ ./test_suite --gtest_filter=Test_FileParser.*
If you only want to test ReadFileToString from FileParser, you can type:
bash$ ./test_suite --gtest_filter=Test_FileParser.ReadFileToString
You can specify which tests are run for any of the tests in the assingment. You just need to know the names of the tests, and you can do this by running:
bash$ ./test_suite --gtest_list_tests
These settings can be helpful for debugging specific parts of the assignment,
especially since test_suite can be run with these settings through
valgrind and gdb!
However, you should not debug your code using only the supplied tests.
The test setup and code are complex enough that it can be hard to isolate problems
effectively without spending excessive amounts of time trying to reverse-engineer
the details of the test_suite code.
Be sure to also run your code on small sample files and directories where you can predict
in advance exactly what data structures should be created and what their contents should be,
and then use gdb
or other tools to verify that things are working exactly as expected.
In addition to passing tests, your code should be high quality and readable. This includes several aspects:
static)
		helper functions, be sure to provide good comments that explain the function
		inputs, outputs, and behavior.  These comments can often be relatively brief
		as long as they
		convey to the reader the information needed to understand how to use
		the function and what it does when executed.cpplint.py --clint tool
		to check for style issues, and use valgrind to check for memory
		problems.  Be sure to fix issues reported before submitting
		your code.  Exception: if cpplint reports style problems in the
		supplied starter code, you should leave that code as-is.When you are ready to turn in your assignment, you should follow
            the same procedures you used for hw0 and hw1, except this time tag
            the repository with hw2-final.  Remember to clean up,
            commit, and push all necessary files to your repository before you
            add the tag.  After you have created and pushed the
            tag, remember to test everything in the CSE Linux environment by
            creating a new clone of the repository in a separate, empty
            directory, checkout the hw2-final tag, and verify that
            everything works as expected.  Refer to the hw0 turnin instructions
            for details, and follow those steps carefully.
When you clone your repository to check it, it normally will not
            include hw1/libhw1.a, which is needed to build hw2.
            Either run "make" in hw1 to recreate it using your solution to hw1,
            or copy the
            libhw1.a file in hw1/solution_binaries
            and place it in the main hw1 folder to use the sample solution version
            of that library.
It is YOUR responsibility to check your work. If your project doesn't build properly when the same steps are done by the course staff to grade it, you may lose a huge amount of the possible credit for the assignment even if almost absolutely everything is actually correct.
If you have done any of the bonus parts, be sure to include
            a hw2-bonus tag in your repository to indicate the commit
            with the bonus version.
            Be sure to test this code by cloning the repo, checkout that tag,
            and verify that everything works as expected.
We will be basing your grade on several elements:
test_suite.cc.  If your code fails a
                  test, we won't attempt to understand why: we're planning on
                  just including the number of points that the test drivers
                  print out.
                  
                Remember: Both code correctness and code quality matter. Both are weighed significantly in the evaluation of your project.