The goals of this project are to:
For this first project, you will be using the Instructional Linux Servers (ceylon.cs, fiji.cs, sumatra.cs, or tahiti.cs) to do your code development and debugging using the gcc C compiler. Each of you should already have an account on these machines. If you are not a CS/CSE major and need an account, go here for information on how to obtain an account.
Search engines are, at their heart, remarkably simple programs. In this project, you'll be building a search engine for files in a UNIX file system. There are two essential components to every search engine, including yours:
Let's dive into each of these components a little deeper.
- The crawler. this is the part of the program that "crawls" through the set of documents, building up an "index" of the content in those documents. The crawler component of your program will sift through the files in a subtree of a Unix file system, looking for text files: files that end in .txt, which we assume contain a set of whitespace and punctuation separated ASCII words. Your crawler will read in all of these files, parse them to extract all of the words in the document, and then add the words to an index.
- The index. this is the part of the program that keeps track of which words were found in which documents in the file system. You will be building a hashtable-based "inverted index". An inverted index is simply a table that maps a word to the set of documents that contain that word.
The Crawler
When launching your program, you will give it a command-line argument: the "root" of the directory subtree to crawl. For example, you may invoke your program like:
[fiji.cs]$ ./textcrawler foo/bar/baz
if you want it to start crawling at the foo/bar/baz directory. The crawler needs to crawl through the directory tree rooted at the specified directory, by recursively looking at all files and subdirectories, starting at the specified root. You will be relying on various system calls to do your crawling, including:
- man 3 opendir
- man 3 readdir
- man 2 stat
- man 3 ctime
- man 3 fopen
- man 3 fgetc
You should watch out for the directories "." and "..", and not follow them, because they point to the "current" directory and the "parent" directory, respectively, and thus will break recursion. Symbolic links of both directories and files should be ignored as well; not doing so will result in files being read in multiple times or infinite recursion if the link is to an ancestor directory. For all files that end in ".txt", the crawler should read through the file and read in all words, converting all characters to lower case, and making sure that the "words" do not contain any punctuation.
The Inverted Index
An inverted index is a hash table with the words pulled from the .txt files used as the keys and information about the files that contained them as the stored value. The hash table need not be fancy. If you use a chained hash table, you can handle any number of words without resizing (though not efficiently). A possible way to hold the file information is in a vector, where the hash table simply holds the index of the appropriate file in the "File List" vector.
Putting it all together
Once the crawler has built the inverted index, you need to be able to search for any words that were found and indexed. The file information that should be returned by the search needs to include the document name, its size, # times the word occurred in the file, when it was last modified, and the first few words from the file. The following is an example session for the textcrawler:
fiji:~$ ./textcrawler ./txtfiles
... crawling ...
... done ...
Search term: fox
(1) ./txtfiles/singleton.txt
Size: 246 bytes
Occurrences: 1
Last modified: Fri Mar 28 10:56:32 2003
"such the fox jumped memphis..."
(3) ./txtfiles/gutenburg/hamlet.txt
Size: 193043 bytes
Occurrences: 1
Last modified: Sat Mar 29 07:08:53 2003
"Project Gutenberg Etext of Hamlet..."
Search term: foo
"foo" not found.
We've put together a "tar" archive containing a sample collection of text files in a directory tree, and also an example C program and "Makefile". You can grab this tar archive by logging into one of the instructional Unix machines, and running the command:
cp /projects/cse/courses/cse451/03sp/projects/intro-project.tar.gz .
Type "tar -xvzf intro-project.tar.gz" to extract the archive. Move into the extracted directory with "cd intro-project". The project can be complied with "make" or "make textcrawler".
You must provide us with the following:
1. The C source code for the crawler, and a Makefile that builds it. We will be compiling your crawler and running it against our own private test cases.
2. A text file that is a histogram which shows the 20 most popular words found by the crawler, and the total number occurrences of these words across all text files. You will need to "special case" your crawler to emit this histogram.
3. A writeup (strictly less than 1 page long) describing what you did and how you did it (e.g., what data structures you used and why), extra features you implemented if any, and what your self-assessment of your implementation is: are there any bugs you're aware of, limitations, improvements you wish you had time for but didn't do, etc.
We will be using the turnin(1L) program to turn in the assignment.
Run "turnin -c cse451 -p introproj [FILES]". I suggest that
you put all of the files you want to turn-in into a directory, and then specify
that directory to the "turnin" program. If all goes well, the following
message should be displayed:
"Compressing submitted files... please wait
Your files have been submitted to cse451, introproj for grading.".