|
CSE Home | About Us | Search | Contact Info |
|
AdministriviaDue Date: (Part two) Tuesday, Oct 29, 12:00 noon. Due Date: (Part three) Tuesday, Nov 12, 12:00 noon. Objective: Add an indexing component to your crawler so that it can handle simple queries. Advanced search features, phrase search, and Boolean processing will be next steps - so think them through before starting. Also keep in mind that you will want to incorporate PageRank or hypertext analysis in addition to traditional information retrieval ranking methods such as TF/IDF analysis. This may mean that you need to store additional web statistics or data about the pages (e.g. font information, anchor text, etc.) you collect. Think this through ahead of time. Reading: Be sure to (re)read the Google paper before you start. Groups & Collaboration: Form teams of two for this part and the rest of the project (if there is an odd number, then there may be one group of three). Special NoteIf a group wants to do a different project than the search-engine suggested here, that's fine, but talk to Tal or Dan soon. And for the part 2 deliverable, hand in a description of your proposed project including the milestones that you will reach by the part 3, 4, 5 due dates.Inverted FilesThe inverted index consists of two parts. A lexicon file, which contains a list of words; and a (set of) occurrence file(s), which contain the actual occurrence information for the word. The lexicon file includes, for each word, the number of documents the word occurs in and a pointer to the word's occurrence list. The occurrence file tells where each occurrence of a word is. The data structure might include information such as how many times the word occurs in the document, where those occurrences are, and ranking information such as if an occurrence is part of a title or anchor text (so called, "fancy" text attributes.) The title/anchor information might also be stored in separate files. There are at least two models for the occurrence files. The "AltaVista" model treats the entire web as a single address space, with each word occupying a separate location (plus a few special "words" that mark important locations such as the beginning and end of documents, titles and links.) The occurrence file for a given word is simply a list of locations. For efficiency sake, these can be stored as location deltas and compressed. The "Google" model stores occurrence information differently. It records information about the "fanciness", size, capitalization and other features of the word along with the position. This makes for a more complex indexing process, but faster searching with quicker retrieval of high-ranking hits. In both cases, the crawler and the indexer are somewhat disconnected. This allows for either component to continue when the other is stopped. The construction of the index is also split up into smaller parts which may then be merged. Each "bucket" (or barrel) contains the index for a subset of the lexicon. This bucket method allows for parallelization of the indexing and search processes. In AltaVista, the indexing process is also split up along the time dimension into separate "tiers" which represent subsequent crawls. This allows for ongoing indexing and easy merging of new results into old. Compression of the occurrence files is important for scalability. You should think about how to store occurrence information, carefully trading off speed and compression. In addition to compressing the occurrence file, you can reduce the size of the lexicon by stemming words. You may wish to incorporate Porter's stemming algorithm. For a variety of reasons, you will want to build a repository of the documents you have crawled. Query ProcessingAs described above, the AltaVista and Google occurrence file result in very different mechanisms for query processing. The structure of Google's occurrence file is such that searching for a word simply requires stepping through that word's occurrence list. Each entry includes a document ID, so building a list of documents matching the query simply requires finding all the document IDs contained in all the query words' occurrence lists. (The default search is a conjunction (and) of all words.)AltaVista's model requires a more complex query processing system. It uses the notion of an Index Stream Reader (ISR). An ISR is an object that reads through a particular word's occurrence list. It implements the following methods:
Given this set of operations, one can build Boolean queries by combining the results of multiple primative ISRs. A simple disjunctive (or) query can be satisfied by merging the results of two primitive ISRs. An "and" query requires a somewhat more complex process whereby the ISRs for the constiuent words are combined with the ISR for the special END_DOCUMENT keyword to guarantee that locations which are returned all come from the same document. This is done with a constraint satisfaction engine which basically requires that the locations for all the words are between END_DOCUMENT.prev and END_DOCUMENT.loc. Project SpecificsYour assignment is:
What to Hand InPart Two (Due Oct 29) Part Three (Due Nov 12) Remember, you will be graded both on the quality of the artifact you have built and the way it is described. Make sure your explanations are clear and experimental results presented in a way that is easy to understand. At a minimum the following should be dexplained:
Note: If you get stuck or can't complete every part of the assignment, do as much as you can. If you try an ambitious method for information extraction, we understand you may not have as much time for other parts and will take this into account. Partial credit (and extra credit!) will definitely be awarded. If a bug or software glitch gets you, let us know as soon as possible (we'll give credit for finding these, and even more credit for finding a solution or workaround) - but keep working on other parts of the assignment. Additional Useful Pointers
Good luck! |
Department of Computer Science & Engineering University of Washington Box 352350 Seattle, WA 98195-2350 (206) 543-1695 voice, (206) 543-2969 FAX [comments to weld] |