CSE 373 Programming Project 1
The Problem
Information retrieval systems allow users to enter keywords
and retrieve articles that have those keywords associated
with them. For example, my student Yi Li has just written
a paper called, "Object Class Recognition using Images of Abstract
Regions," and he included the following keywords: 'object recognition',
'abstract regions', 'mixture models', and 'EM algorithm'. If someone does
a search for all articles about the EM algorithm, then his paper
(and many others) will be retrieved.
Of course, the retrieval system should not have to search through
millions of papers every time someone enters a key word. So they
usually use data structures to make the retrievals efficient.
The most common structure is the inverted index. The
key words are all kept in this index, and each one points to
a list of all the articles that have that key word. The index
itself can have various structures, but we will just use an
array in which the key words appear in alphabetical order
(and we'll use key words that have no spaces in them).
Our structure will be an array of keyword objects. Each element
will have a keyword and a list pointer. They array is kept
in ascending order according to the keywords. e.g.
- database
- EM-algorithm
- image-retrieval
- matching
- recognition
The list of all papers about 'database' is pointed to by
the list pointer of the element holding 'database'.
(See Details below.)
Your Job
You will write a program with two parts:
- The first part reads in the IDs, titles, authors, and keywords
for a set of articles
and creates the inverted index. You should use the
binary search algorithm given in Chapter 2 of your text to find the
right position in the array for a given keyword. Then if the keyword
is already in the array, you merely add a new element for the new
article to its list (at the beginning, since it's O(1) time).
If, however, the keyword is not already in the index, you have
to insert it. Since the list is in an array, you will have to
move down all the elements below it by one position. (When we get to
more advanced structures, insertion will be more efficient.)
- The second part reads a given list of keywords and for each one,
finds all the articles with that key word and prints
the information about each one in the format we give you
(so we can grade them easily).
Since you will again use binary search to find the keywords, the
search part will be O(log N) if there are N keywords in the index.
The printing part for one such keyword will be O(K) if there are
K articles with that keyword, because you must go through the
entire list to print the elements.
Details
The following picture shows the structures with which you will work.
The index is an array Keywords, each element of which is
an object of type KeyWord. The array has some number MAX_KEYWORD_COUNT
of elements (use 500). The variable currentCount gives the current
number of elements, which is 3 in the picture ('data', 'image',
and 'search'). Each KeyWord object has two
fields: keyword and records. The keyword field contains the keyword
and the records field contains a pointer to (eventually) all articles
having that keyword. In the picture, the keyword 'data' has an associated
pointer to a list of 2 articles.
The information about an article is kept in a Record object,
which has four parts: id, title, author, and next.
We will give you a class called FileData that allows you to
open a file, read the information for each article (its id, title, author and
keywords), and close the file. You will develop the class Index
with the methods to create and populate the index and the class AnswerQueries
with the methods to query the index with a given keyword and print the
keyword plus the information for each associated article.
Index
This class creates the index structure. It must include the array Keywords,
the constant MAX, and the variable currentCount. In a loop, it should
- Read the information for the next article from the file. (If the
end of file is reached, this part is done.)
- Create a Record for the article.
- For each keyword associated with that article:
- Search for the keyword (by binary search) in Keywords.
- If it is there, add this Record to the associated list of articles.
- If it is not there, add the keyword to the array Keywords and
make its records pointer point to the new Record.
Here are two examples of articles:
46359
A Content-Based Retrieval System for Medical Images
John Anderson
4
database
image-retrieval
content-based
medical
37503
The Virage Search Engine
James Bach
3
image-retrieval
search
image-management
AnswerQueries
This class reads keywords from the query file and queries the index
for each one. It should use the same binary search that Index
uses to locate the keyword in the index. If the keyword is not in
the index, it will print an error message. If it finds the keyword
it will print the keyword followed by the id, title, and author of
each article in the list for that keyword.
Code Download
The code base can be downloaded for C++ or Java depending upon
the language you prefer. Along with the code base you will be provided
a sample data file and a sample query file. The programs already
contain routines to read these files in their respective formats.
Please note that the output generated by your program must adhere
to the format specified in the sample output file. (Make sure
you turn off all your debugging statements at the time of final
turnin.)
If your browser just dumps ascii characters on the screen when
you try to download these files, try using the right click, followed
by the Save As option.
Code Base (Linux Users):
JAVA (contains 1 .java file)
C++ (contains 2 .h files, 1 .cpp file, 1 text file)
Code Base (Windows Users):
JAVA (contains 1 .java file)
C++ (contains 2 .h files, 1 .cpp file, 1 text file)
Additional Files:
Data File (Download)
Query File (Download)
Sample Output (Download)
Turnin
Please refer to turnin instructions here.