CSE 373 Programming Project 1

The Problem

Information retrieval systems allow users to enter keywords and retrieve articles that have those keywords associated with them. For example, my student Yi Li has just written a paper called, "Object Class Recognition using Images of Abstract Regions," and he included the following keywords: 'object recognition', 'abstract regions', 'mixture models', and 'EM algorithm'. If someone does a search for all articles about the EM algorithm, then his paper (and many others) will be retrieved.

Of course, the retrieval system should not have to search through millions of papers every time someone enters a key word. So they usually use data structures to make the retrievals efficient. The most common structure is the inverted index. The key words are all kept in this index, and each one points to a list of all the articles that have that key word. The index itself can have various structures, but we will just use an array in which the key words appear in alphabetical order (and we'll use key words that have no spaces in them).

Our structure will be an array of keyword objects. Each element will have a keyword and a list pointer. They array is kept in ascending order according to the keywords. e.g.

The list of all papers about 'database' is pointed to by the list pointer of the element holding 'database'. (See Details below.)

Your Job

You will write a program with two parts:
  1. The first part reads in the IDs, titles, authors, and keywords for a set of articles and creates the inverted index. You should use the binary search algorithm given in Chapter 2 of your text to find the right position in the array for a given keyword. Then if the keyword is already in the array, you merely add a new element for the new article to its list (at the beginning, since it's O(1) time). If, however, the keyword is not already in the index, you have to insert it. Since the list is in an array, you will have to move down all the elements below it by one position. (When we get to more advanced structures, insertion will be more efficient.)

  2. The second part reads a given list of keywords and for each one, finds all the articles with that key word and prints the information about each one in the format we give you (so we can grade them easily). Since you will again use binary search to find the keywords, the search part will be O(log N) if there are N keywords in the index. The printing part for one such keyword will be O(K) if there are K articles with that keyword, because you must go through the entire list to print the elements.

Details

The following picture shows the structures with which you will work.

The index is an array Keywords, each element of which is an object of type KeyWord. The array has some number MAX_KEYWORD_COUNT of elements (use 500). The variable currentCount gives the current number of elements, which is 3 in the picture ('data', 'image', and 'search'). Each KeyWord object has two fields: keyword and records. The keyword field contains the keyword and the records field contains a pointer to (eventually) all articles having that keyword. In the picture, the keyword 'data' has an associated pointer to a list of 2 articles. The information about an article is kept in a Record object, which has four parts: id, title, author, and next.

We will give you a class called FileData that allows you to open a file, read the information for each article (its id, title, author and keywords), and close the file. You will develop the class Index with the methods to create and populate the index and the class AnswerQueries with the methods to query the index with a given keyword and print the keyword plus the information for each associated article.

Index

This class creates the index structure. It must include the array Keywords, the constant MAX, and the variable currentCount. In a loop, it should

  1. Read the information for the next article from the file. (If the end of file is reached, this part is done.)

  2. Create a Record for the article.

  3. For each keyword associated with that article:

Here are two examples of articles:

46359
A Content-Based Retrieval System for Medical Images
John Anderson
4
database
image-retrieval
content-based
medical

37503
The Virage Search Engine
James Bach
3
image-retrieval
search
image-management

AnswerQueries

This class reads keywords from the query file and queries the index for each one. It should use the same binary search that Index uses to locate the keyword in the index. If the keyword is not in the index, it will print an error message. If it finds the keyword it will print the keyword followed by the id, title, and author of each article in the list for that keyword.

Code Download

The code base can be downloaded for C++ or Java depending upon the language you prefer. Along with the code base you will be provided a sample data file and a sample query file. The programs already contain routines to read these files in their respective formats.
Please note that the output generated by your program must adhere to the format specified in the sample output file. (Make sure you turn off all your debugging statements at the time of final turnin.)
If your browser just dumps ascii characters on the screen when you try to download these files, try using the right click, followed by the Save As option.
Code Base (Linux Users):
JAVA (contains 1 .java file)
C++ (contains 2 .h files, 1 .cpp file, 1 text file)
Code Base (Windows Users):
JAVA (contains 1 .java file)
C++ (contains 2 .h files, 1 .cpp file, 1 text file)
Additional Files:
Data File (Download)
Query File (Download)
Sample Output (Download)

Turnin

Please refer to turnin instructions here.