CSE 326
Project 2

Code due: February 19, 11:00 PM
Report due: February 20, beginning of class

Overview

In this project you will be implementing a very basic search engine. The program begins by parsing a collection of text files. Then the user can give single-word queries, and the program returns the names of the files that contain the query word. The “front-end”, which handles all the file I/O and parsing, is written for you.  Your job is to implement the “back-end” data structures. The back-end is a Dictionary ADT that maps each word to a list of files containing that word. A quick and dirty implementation based on an unsorted array is included. You will add the following implementations of the Dictionary ADT:

 

  1. an unbalanced binary search tree

 

  1. an AVL tree

 

  1. a splay tree

 

After you implement them, you will collect data on the performance of these three kinds of trees.  As in Project 1, you will work in pairs.

Provided Code

C Code

Download the skeleton code here. 

To compile the provided code, run:

      gcc –o search *.c

When executing search, the first argument to the program is the data structure to use.  This can be -array, -unbalTree, -avlTree, or -splayTree.  All of the remaining arguments are text files to include in the dictionary.  For example, without modifying your code at all you should be able to compile then run:

      search –array *.c

to build a dictionary from the code itself.

After parsing the text files to be included in the dictionary, the program waits for user input.  To perform a search, enter a single word then hit return.  To exit, enter the EOF character (Ctrl-D).  You can perform many queries at once by piping a text file to STDIN:

      search –array dictionaryfiles < queryfile

The front-end code is in search.c. The important methods are main(), which parses the command arguments and handles user input, and parseFile() which reads a file into the dictionary. For each word in the file, parseFile() inserts the word as a key, with the file name as a value. To find all files that contained a given word, you just do a find on that word. To remain implementation independent, the functions newDictionary(), insert(), and find() serve as wrappers for the particular implementation. The unsorted array implementation is in array.c. Stubs for the three tree types are in the other three files. The dictionary is slightly complicated in that the value stored is a list of strings (e.g. the names of files that contained the key word.) When doing an insert you don’t want to overwrite the value, you want to add a name to the list, taking care not to allow duplicates. The list is implemented as a basic linked list using the StringList structure. See array insert() for an example.

Java Code

Download the skeleton code here.

To compile the provided code, run:

javac Search.java -Xlint

When executing search, the first argument to the program is the data structure to use.  This can be -array, -unbalTree, -avlTree, or –splayTree. You can also use –testArray, -testSplayTree, -testAvlTree, -testUnbalTree to run the test functions in the various classes. Note that these command options should be the first in the list. Second command option can be –f <query_file>. If –f option is given then the code will perform queries on each of the words in query file. All of the remaining arguments are text files to include in the dictionary.  For example, without modifying your code at all you should be able to compile then run:

      java Search –array *.java

      java Search –array –f myqueryfile inp1.txt inp2.txt

to build a dictionary from the code itself. Note that if the 1st option is –test* then all the other command options will be ignored.

If –f option is not given then after parsing the text files to be included in the dictionary, the program waits for user input.  To perform a search, enter a single word then hit return.  To exit, enter the EOF character (Ctrl-D). 

The main classes in the code are Search and Engine. The important methods are main(), which parses the command options/arguements and creates an engine which is used to parse files using Engine.parseFile() which reads a file into the dictionary. For each word in the file, parseFile() inserts the word as a key, with the file name as a value. To find all files that contained a given word, you just do a find on that word. The unsorted array implementation is in ArrayDictionary.java. Stubs for the three tree types are in the other three files. The dictionary is slightly complicated in that the value stored is a list of strings (e.g. the names of files that contained the key word.) When doing an insert you don’t want to overwrite the value, you want to add a name to the list, taking care not to allow duplicates. The list is implemented in the class StringList class. See ArrayDictionary.insert() for an example. We also provide the function for you to get the current time in class Search. The difference between C and Java CurrentTime() is that C returns int and Java returns long.

Requirements

1.      Fill in the constructor/new() (Java/C), insert(), and find() functions for each of the three tree classes/types. You will need to define a node classes/structure for each of the tree types, for example UnbalTreeNode class in java. See array.c for an example of defining a structure in C. See ArrayDictionary.java for the example in java case. The code for unbalanced binary search trees and AVL trees can be found in Weiss. You are free to use this code if you like.

You may want to fill in the methods to test each of your data structures, to aid in debugging. Writing a print routine and printing the data structure after a few insertions is a good way to do this.

2.      Evaluate the performance of each of your data structures on the provided text files.  We’ve given you 5 text files.  Create a dictionary using all of the files, then use each one in turn as the query file.  Record the following for AVL trees and splay trees (unbalanced trees may take a prohibitive amount of time):

·        total insertion time

·        total query time (one number for each of the 5 files)

·        depth of queried node (record the max and mean and standard deviation over all queries for each file)

·        number of rotations performed per query (again, record the max and mean and standard deviation)

You'll need to modify your code to include counters for search depth and number of rotations. A double rotation counts as two rotations.

Download the collection of text files here.

 

Hints

·        Memory management in C is tricky, especially if you’re used to languages with automatic memory management like Java. Pay very close attention to how you use pointers. One example: when a key is passed to the insert method, the string needs to be copied, not just the pointer. This is because the string is a local variable in the calling method, and won’t exist once the method returns. This is also true for the file names, but we can be sloppy here and just copy the pointer, since we know the file names came from the command line arguments and aren’t going to go away. You should understand the details of the array insert method before writing your own code.

·        The unbalanced binary tree is the easiest tree. Get it working completely before you start on the others. Your code will be somewhat different, but if it’s looking significantly more complicated than the code in Weiss, you’re probably doing something wrong.

·        The three tree types share a fair amount of code.  Using pair programming may prevent you and your partner from writing the same code separately.

·        You can write your splay tree with or without parent pointers.  It’s up to you.  Parent pointers will probably make coding a bit easier, though. You can also implement top-down splay trees as described in Chapter 12 of Weiss.