Steam-powered Turing Machine University of Washington Department of Computer Science & Engineering
 CSE454 Project Part 4: Enhancing the Search Capabilities
  CSE Home   About Us    Search    Contact Info 

Administrivia
 Home
 Using course email
 Email archive
 Policies
Content
 Overview
 Resources
 Lecture slides
Assignments
 Reading
 Project
   

Administrivia

Due Date: Tuesday, Nov 26, 12:00 noon.

Project Specifics

By the next due date you should have a web interface that supports multiple word queries and has indexed a relatively large number of pages (100,000+ would be nice).

One relatively easy way to create a web interface is by using PHP as a front end that opens a socket to a process/server that you can write in Java that processes the queries and returns the results. Here is information on how to do this. Of course feel free to do this in another way.

Your assignment over the remainder of the course is to enhance your search engine. What you do is up to you, but some ideas are:

  • Complex Boolean queries
  • Sophisticated IR ranking
  • Laten Semantic Indexing
  • Anchor text analysis for improved ranking
  • PageRank calculations and integration into ranking
  • A sophisticated web front end (graphic visualization, personalization, or whatever)
  • Classification of search engine results
  • Clustering of results
  • Scaleup and / or improved speed / efficiency
  • Image, audio, or other specialized search
  • Search for webcams, home pages, or another specific class of pages
  • Extract email addresses, phone numbers, class times, or other info from pages (and specialized search)
  • Whatever other cool ideas you can think of
Some links that may be helpful for clustering/classification and learning:
http://www-2.cs.cmu.edu/~mccallum/bow/
http://www.cse.unsw.edu.au/~quinlan/

The key deadline to consider is the last day of class (for the final writeup) and the final week (for your presentations), but we also want a checkpoint to be delivered on 11/26 so we can track progress.

What to Hand In

Hand in the URL of a top-level web page (don't overwrite pages already turned in) that lists your team name and contact information for each member. Remember, you will be graded both on the quality of the artifact you have built and the way it is described. At a minimum the web page(s) should explain:

  1. Instructions on how to test your project. This should include how to start the server if necessary, and the URL of the web interface.
  2. The overall plan for your final project deliverable
  3. The anticipated roles for each team member
  4. A brief (less than a page) status report

Note: If you get stuck or can't complete every part of the assignment, do as much as you can. If you try an ambitious method for information extraction, we understand you may not have as much time for other parts and will take this into account. Partial credit (and extra credit!) will definitely be awarded. If a bug or software glitch gets you, let us know as soon as possible (we'll give credit for finding these, and even more credit for finding a solution or workaround) - but keep working on other parts of the assignment.

Additional Useful Pointers

  • Instructions, hints and troubleshooting ideas (to be added as we make progress)
A number of groups (most, hopefully) are implementing PageRank. Recall that this involves a number of matrix multiplications where the matricies are connectivity tables for the pages in question. Thus if you've crawled 100,000 pages, the naive matrix representation will have 10^10 entries - multiplying two such matrices might take a while :-)

Fortunately, most entries are zero since most pages point to ~10 other pages, not 100,000. Thus you'll probably want to use a sparse representation. Some papers on the topic:

This one describes fortran and c libraries for doing multiplication of sparse matricies. Since you'll probably want to run pagerank offline anyway, implementing in a different language than your crawler shouldn't be a problem. The code (SMMP) is available for download here.

This paper has a short explanation on sparse matrix multiplcation, but focusses on parallel algorithms.

You can probably find more stuff by searching the web.

Good luck!


CSE logo Department of Computer Science & Engineering
University of Washington
Box 352350
Seattle, WA  98195-2350
(206) 543-1695 voice, (206) 543-2969 FAX
[comments to weld]