CSE 451 Spring 2003: Introductory Project

A Search Engine for Unix files

Out: Monday, March 31st
Due: Monday, April 7, 8:30am PST

Overview

In this introductory project, you'll build a search engine for files in a UNIX file system. In a nutshell, you will be writing a program that scans through a portion of a file system, looks for text files (files whose name ends in .txt), reads in those files, and builds an inverted index of the words in those files. Using this inverted index, your program will prompt users for a word to search for, and then return a list scanned files that contain that word. Think of this as a simple google for text files in a file system, as opposed to HTML documents on the web.

The goals of this project are to:

Give you enough mastery of C to make sure you're in good shape for the rest of the projects happening this quarter
To give you some insight into the basic design of search engines in general; what you build is a simple version of what real search engines do in practice!

[ lab facilities | project description | compiling and running your code | deliverables ]

The Project Labs

For this first project, you will be using the Instructional Linux Servers (ceylon.cs, fiji.cs, sumatra.cs, or tahiti.cs) to do your code development and debugging using the gcc C compiler. Each of you should already have an account on these machines. If you are not a CS/CSE major and need an account, go here for information on how to obtain an account.

What you will build

Search engines are, at their heart, remarkably simple programs. In this project, you'll be building a search engine for files in a UNIX file system. There are two essential components to every search engine, including yours:

The crawler. this is the part of the program that "crawls" through the set of documents, building up an "index" of the content in those documents. The crawler component of your program will sift through the files in a subtree of a Unix file system, looking for text files: files that end in .txt, which we assume contain a set of whitespace and punctuation separated ASCII words. Your crawler will read in all of these files, parse them to extract all of the words in the document, and then add the words to an index.

The index. this is the part of the program that keeps track of which words were found in which documents in the file system. You will be building a hashtable-based "inverted index". An inverted index is simply a table that maps a word to the set of documents that contain that word.
Let's dive into each of these components a little deeper.
The Crawler

When launching your program, you will give it a command-line argument: the "root" of the directory subtree to crawl. For example, you may invoke your program like:

[fiji.cs]$ ./textcrawler foo/bar/baz

if you want it to start crawling at the foo/bar/baz directory. The crawler needs to crawl through the directory tree rooted at the specified directory, by recursively looking at all files and subdirectories, starting at the specified root. You will be relying on various system calls to do your crawling, including:

man 3 opendir

man 3 readdir

man 2 stat

man 3 ctime

man 3 fopen

man 3 fgetc

You should watch out for the directories "." and "..", and not follow them, because they point to the "current" directory and the "parent" directory, respectively, and thus will break recursion. Symbolic links of both directories and files should be ignored as well; not doing so will result in files being read in multiple times or infinite recursion if the link is to an ancestor directory. For all files that end in ".txt", the crawler should read through the file and read in all words, converting all characters to lower case, and making sure that the "words" do not contain any punctuation.

The Inverted Index

An inverted index is a hash table with the words pulled from the .txt files used as the keys and information about the files that contained them as the stored value. The hash table need not be fancy. If you use a chained hash table, you can handle any number of words without resizing (though not efficiently). A possible way to hold the file information is in a vector, where the hash table simply holds the index of the appropriate file in the "File List" vector.

Putting it all together

Once the crawler has built the inverted index, you need to be able to search for any words that were found and indexed. The file information that should be returned by the search needs to include the document name, its size, # times the word occurred in the file, when it was last modified, and the first few words from the file. The following is an example session for the textcrawler:

fiji:~$ ./textcrawler ./txtfiles ... crawling ... ... done ... Search term: fox (1) ./txtfiles/singleton.txt Size: 246 bytes Occurrences: 1 Last modified: Fri Mar 28 10:56:32 2003 "such the fox jumped memphis..." (3) ./txtfiles/gutenburg/hamlet.txt Size: 193043 bytes Occurrences: 1 Last modified: Sat Mar 29 07:08:53 2003 "Project Gutenberg Etext of Hamlet..." Search term: foo "foo" not found.

Compiling your project

We've put together a "tar" archive containing a sample collection of text files in a directory tree, and also an example C program and "Makefile". You can grab this tar archive by logging into one of the instructional Unix machines, and running the command:

cp /projects/cse/courses/cse451/03sp/projects/intro-project.tar.gz .

Type "tar -xvzf intro-project.tar.gz" to extract the archive. Move into the extracted directory with "cd intro-project". The project can be complied with "make" or "make textcrawler".

Deliverables

You must provide us with the following:

1. The C source code for the crawler, and a Makefile that builds it. We will be compiling your crawler and running it against our own private test cases.

2. A text file that is a histogram which shows the 20 most popular words found by the crawler, and the total number occurrences of these words across all text files. You will need to "special case" your crawler to emit this histogram.

3. A writeup (strictly less than 1 page long) describing what you did and how you did it (e.g., what data structures you used and why), extra features you implemented if any, and what your self-assessment of your implementation is: are there any bugs you're aware of, limitations, improvements you wish you had time for but didn't do, etc.

We will be using the turnin(1L) program to turn in the assignment. Run "turnin -c cse451 -p introproj [FILES]". I suggest that you put all of the files you want to turn-in into a directory, and then specify that directory to the "turnin" program. If all goes well, the following message should be displayed:
"Compressing submitted files... please wait Your files have been submitted to cse451, introproj for grading.".