A Search Engine for Unix files

Overview

In this introductory project, you'll build a search engine for files in a UNIX file system.  In a nutshell, you will be writing a program that scans through a portion of a file system, looks for text files (files whose name ends in .txt), reads in those files, and builds an inverted index of the words in those files. Using this inverted index, your program will prompt users for a word to search for, and then return a list scanned files that contain that word. Think of this as a simple google for text files in a file system, as opposed to HTML documents on the web.

The goals of this project are to:

[ lab facilities | project description | compiling and running your code | deliverables ]

The Project Labs

For this first project, you will be using the Instructional Linux Servers (ceylon.cs, fiji.cs, sumatra.cs, or tahiti.cs) to do your code development and debugging using the gcc C compiler. Each of you should already have an account on these machines. If you are not a CS/CSE major and need an account, go here for information on how to obtain an account.

What you will build

Search engines are, at their heart, remarkably simple programs. In this project, you'll be building a search engine for files in a UNIX file system. There are two essential components to every search engine, including yours:

Let's dive into each of these components a little deeper.

The Crawler

When launching your program, you will give it a command-line argument: the "root" of the directory subtree to crawl.  For example, you may invoke your program like:

[fiji.cs]$   ./textcrawler  foo/bar/baz

if you want it to start crawling at the foo/bar/baz directory.  The crawler needs to crawl through the directory tree rooted at the specified directory, by recursively looking at all files and subdirectories, starting at the specified root.  You will be relying on various system calls to do your crawling, including:

You should watch out for the directories "." and "..", and not follow them, because they point to the "current" directory and the "parent" directory, respectively, and thus will break recursion. Symbolic links of both directories and files should be ignored as well; not doing so will result in files being read in multiple times or infinite recursion if the link is to an ancestor directory. For all files that end in ".txt", the crawler should read through the file and read in all words, converting all characters to lower case, and making sure that the "words" do not contain any punctuation.  

The Inverted Index

An inverted index is a hash table with the words pulled from the .txt files used as the keys and information about the files that contained them as the stored value. The hash table need not be fancy. If you use a chained hash table, you can handle any number of words without resizing (though not efficiently).  A possible way to hold the file information is in a vector, where the hash table simply holds the index of the appropriate file in the "File List" vector.

Putting it all together

Once the crawler has built the inverted index, you need to be able to search for any words that were found and indexed.  The file information that should be returned by the search needs to include the document name, its size, # times the word occurred in the file, when it was last modified, and the first few words from the file.  The following is an example session for the textcrawler:

fiji:~$ ./textcrawler ./txtfiles
... crawling ...
... done ...

Search term: fox
   (1) ./txtfiles/singleton.txt
        Size: 246 bytes
        Occurrences: 1
        Last modified: Fri Mar 28 10:56:32 2003
        "such the fox jumped memphis..."

   (3) ./txtfiles/gutenburg/hamlet.txt
        Size: 193043 bytes
        Occurrences: 1
        Last modified: Sat Mar 29 07:08:53 2003
        "Project Gutenberg Etext of Hamlet..."

Search term: foo
"foo" not found.

Compiling your project

We've put together a "tar" archive containing a sample collection of text files in a directory tree, and also an example C program and "Makefile".  You can grab this tar archive by logging into one of the instructional Unix machines, and running the command:

   cp /projects/cse/courses/cse451/03sp/projects/intro-project.tar.gz .

Type "tar -xvzf intro-project.tar.gz" to extract the archive. Move into the extracted directory with "cd intro-project". The project can be complied with "make" or "make textcrawler".

Deliverables

You must provide us with the following:

1. The C source code for the crawler, and a Makefile that builds it.  We will be compiling your crawler and running it against our own private test cases.

2. A text file that is a histogram which shows the 20 most popular words found by the crawler, and the total number occurrences of these words across all text files.  You will need to "special case" your crawler to emit this histogram.

3. A writeup (strictly less than 1 page long) describing what you did and how you did it (e.g., what data structures you used and why), extra features you implemented if any, and what your self-assessment of your implementation is: are there any bugs you're aware of, limitations, improvements you wish you had time for but didn't do, etc.

We will be using the turnin(1L) program to turn in the assignment. Run "turnin -c cse451 -p introproj [FILES]".  I suggest that you put all of the files you want to turn-in into a directory, and then specify that directory to the "turnin" program. If all goes well, the following message should be displayed:
"Compressing submitted files... please wait
Your files have been submitted to cse451, introproj for grading.
".