CSE490i Project Part 2

University of Washington Department of Computer Science & Engineering

CSE490i Project Part 2: Inverted Indicies and Simple Search

CSE Home

About Us

Contact Info

Administrivia

Home

Using course email

Email archive

Policies

Content

Overview

Resources

Lecture slides

Assignments

Reading

Project

Administrivia

Due Date: Tuesday, Feb 12, 12:00 noon.
Caution: Be sure to think through your design thru project parts 3 and 4 before you commit to code in this part!
Objective: Add an indexing component to your crawler so that it can handle simple queries. Advanced search features link phrase search and Boolean processing will be next steps so think them through before starting. Also keep in mind that you will want to incorporate PageRank or hypertext analysis in addition to traditional information retrieval ranking methods such as TF/IDF analysis. This may mean that you need to store additional web statistics or data about the pages (e.g. font information, anchor text, etc.) you collect. Think this through ahead of time.
Reading: Be sure to (re)read the Mercator and Google papers before you start.
Groups & Collaboration: Same as part 1.
Inverted Files

The inverted index consists of two parts. A lexicon file, which contains a list of words; and a (set of) occurrence file(s), which contain the actual occurrence information for the word.
The lexicon file includes, for each word, the number of documents the word occurs in and a pointer to the word's occurrence list.
The occurrence file tells where each occurrence of a word is. The data structure might include information such as how many times the word occurs in the document, where those occurrences are, and ranking information such as if an occurrence is part of a title or anchor text (so called, "fancy" text attributes.) The title/anchor information might also be stored in separate files.
There are at least two models for the occurrence files. The "AltaVista" model treats the entire web as a single address space, with each word occupying a separate location (plus a few special "words" that mark important locations such as the beginning and end of documents, titles and links.) The occurrence file for a given word is simply a list of locations. For efficiency sake, these can be stored as location deltas and compressed.
The "Google" model stores occurrence information differently. It records information about the "fanciness", size, capitalization and other features of the word along with the position. This makes for a more complex indexing process, but faster searching with quicker retrieval of high-ranking hits.
In both cases, the crawler and the indexer are somewhat disconnected. This allows for either component to continue when the other is stopped. The construction of the index is also split up into smaller parts which may then be merged. Each "bucket" (or barrel) contains the index for a subset of the lexicon. This bucket method allows for parallelization of the indexing and search processes. In AltaVista, the indexing process is also split up along the time dimension into separate "tiers" which represent subsequent crawls. This allows for ongoing indexing and easy merging of new results into old.
Compression of the occurrence files is important for scalability. You should think about how to store occurrence information, carefully trading off speed and compression. In addition to compressing the occurrence file, you can reduce the size of the lexicon by stemming words. You may wish to incorporate Porter's stemming algorithm.
For a variety of reasons, you will want to build a repository of the documents you have crawled.
Query Processing
As described above, the AltaVista and Google occurrence file result in very different mechanisms for query processing. The structure of Google's occurrence file is such that searching for a word simply requires stepping through that word's occurrence list. Each entry includes a document ID, so building a list of documents matching the query simply requires finding all the document IDs contained in all the query words' occurrence lists. (The default search is a conjunction (and) of all words.)
AltaVista's model requires a more complex query processing system. It uses the notion of an Index Stream Reader (ISR). An ISR is an object that reads through a particular word's occurrence list. It implements the following methods:

loc

Returns the word location for the current occurrence

prev

Returns the word location for the previous occurrence

next

Moves to the next occurrence

seek(loc)

Moves to the first occurrence after loc

Given this set of operations, one can build Boolean queries by combining the results of multiple primative ISRs. A simple disjunctive (or) query can be satisfied by merging the results of two primitive ISRs. An "and" query requires a somewhat more complex process whereby the ISRs for the constiuent words are combined with the ISR for the special END_DOCUMENT keyword to guarantee that locations which are returned all come from the same document. This is done with a constraint satisfaction engine which basically requires that the locations for all the words are between END_DOCUMENT.prev and END_DOCUMENT.loc.
Project Specifics

Your assignment is:

Continue to improve the robustness and scale of your crawler.
Create an indexer which will parse and index the pages visited by your crawler.
Build a simple web interface which processes simple one word queries.
* Add stemming to the indexer
* Snippets - Display some text (a title, first 20 words of document, whatever) in addition to URLs returned by search
* Extend to multiple word search
** Extend to general boolean search
** Advanced snippets - display text snippet containing search term(s)
** Feel free to propose your own enhancements.
Items with *'s are extra credit, more stars = more credits.
What to Hand In

Hand in the URL of a top-level web page that lists your team name and contact information for each member. Remember, you will be graded both on the quality of the artifact you have built and the way it is described. Make sure your explanations are clear and experimental results presented in a way that is easy to understand. At a minimum the web page(s) should explain:

The roles of each team member.
Any improvements to your crawler.
Indexing algorithm
A discussion of your approach to compression (of the repository, of the occurrence lists, and whatever else).
Implementation of single keyword search
Your plan for handling multiword and Boolean queries (even if this is deferred until next time).
Implementation of bells and whistles (if any)
Statistics on the number of distinct pages and words found. Timing information on the performance of your crawler, indexer and simple query processor.
A pointer to a log file which lists all the URLs which you have indexed, ASCII format, one URL per line. (We'll use this for testing purposes).

Note: If you get stuck or can't complete every part of the assignment, do as much as you can. If you try an ambitious method for information extraction, we understand you may not have as much time for other parts and will take this into account. Partial credit (and extra credit!) will definitely be awarded. If a bug or software glitch gets you, let us know as soon as possible (we'll give credit for finding these, and even more credit for finding a solution or workaround) - but keep working on other parts of the assignment.
Additional Useful Pointers

Instructions, hints and troubleshooting ideas (to be added as we make progress)

Good luck!

Department of Computer Science & Engineering
University of Washington
Box 352350
Seattle, WA 98195-2350
(206) 543-1695 voice, (206) 543-2969 FAX
[comments to weld]