Steam-powered Turing Machine University of Washington Department of Computer Science & Engineering
 CSE490i Project Part 3: Enhancing the Search Capabilities
  CSE Home   About Us    Search    Contact Info 

Administrivia
 Home
 Using course email
 Email archive
 Policies
Content
 Overview
 Resources
 Lecture slides
Assignments
 Reading
 Project
   

Administrivia

Due Date: Tuesday, Feb 26, 12:00 noon.

Project Specifics

Your assignment over the remainder of the course is to enhance your search engine. What you do is up to you, but some ideas are:

  • Complex Boolean queries
  • Sophisticated IR ranking
  • Laten Semantic Indexing
  • Anchor text analysis for improved ranking
  • PageRank calculations and integration into ranking
  • A sophisticated web front end (graphic visualization, personalization, or whatever)
  • Classification of search engine results
  • Clustering of results
  • Scaleup and / or improved speed / efficiency
  • Image, audio, or other specialized search
  • Whatever other cool ideas you can think of
The key deadline to consider is the last day of class (for the final writeup) and the final week (for your presentations), but we also want a checkpoint to be delivered on 2/26 so we can track progress.

In the meantime, as announced in class, I will be meeting with every group. I'd like all members to attend the meeting and I'd like to hear your thoughts on 1) what you'll do, and 2) the roles of the various team members.

What to Hand In

Hand in the URL of a top-level web page (don't overwrite pages already turned in) that lists your team name and contact information for each member. Remember, you will be graded both on the quality of the artifact you have built and the way it is described. At a minimum the web page(s) should explain:

  1. The overall plan for your final project deliverable
  2. The anticipated roles for each team member
  3. A brief (less than a page) status report
  4. A snapshot of your code

Note: If you get stuck or can't complete every part of the assignment, do as much as you can. If you try an ambitious method for information extraction, we understand you may not have as much time for other parts and will take this into account. Partial credit (and extra credit!) will definitely be awarded. If a bug or software glitch gets you, let us know as soon as possible (we'll give credit for finding these, and even more credit for finding a solution or workaround) - but keep working on other parts of the assignment.

Additional Useful Pointers

  • Instructions, hints and troubleshooting ideas (to be added as we make progress)
A number of groups (most, hopefully) are implementing PageRank. Recall that this involves a number of matrix multiplications where the matricies are connectivity tables for the pages in question. Thus if you've crawled 100,000 pages, the naive matrix representation will have 10^10 entries - multiplying two such matrices might take a while :-)

Fortunately, most entries are zero since most pages point to ~10 other pages, not 100,000. Thus you'll probably want to use a sparse representation. Some papers on the topic:

This one describes fortran and c libraries for doing multiplication of sparse matricies. Since you'll probably want to run pagerank offline anyway, implementing in a different language than your crawler shouldn't be a problem. The code (SMMP) is available for download here.

This paper has a short explanation on sparse matrix multiplcation, but focusses on parallel algorithms.

You can probably find more stuff by searching the web.

Good luck!


CSE logo Department of Computer Science & Engineering
University of Washington
Box 352350
Seattle, WA  98195-2350
(206) 543-1695 voice, (206) 543-2969 FAX
[comments to weld]