|
CSE Home | About Us | Search | Contact Info |
|
AdministriviaDue Date: Tuesday, Nov 26, 12:00 noon. Project SpecificsBy the next due date you should have a web interface that supports multiple word queries and has indexed a relatively large number of pages (100,000+ would be nice). One relatively easy way to create a web interface is by using PHP as a front end that opens a socket to a process/server that you can write in Java that processes the queries and returns the results. Here is information on how to do this. Of course feel free to do this in another way. Your assignment over the remainder of the course is to enhance your search engine. What you do is up to you, but some ideas are:
http://www-2.cs.cmu.edu/~mccallum/bow/ http://www.cse.unsw.edu.au/~quinlan/ The key deadline to consider is the last day of class (for the final writeup) and the final week (for your presentations), but we also want a checkpoint to be delivered on 11/26 so we can track progress. What to Hand InHand in the URL of a top-level web page (don't overwrite pages already turned in) that lists your team name and contact information for each member. Remember, you will be graded both on the quality of the artifact you have built and the way it is described. At a minimum the web page(s) should explain:
Note: If you get stuck or can't complete every part of the assignment, do as much as you can. If you try an ambitious method for information extraction, we understand you may not have as much time for other parts and will take this into account. Partial credit (and extra credit!) will definitely be awarded. If a bug or software glitch gets you, let us know as soon as possible (we'll give credit for finding these, and even more credit for finding a solution or workaround) - but keep working on other parts of the assignment. Additional Useful Pointers
Fortunately, most entries are zero since most pages point to ~10 other pages, not 100,000. Thus you'll probably want to use a sparse representation. Some papers on the topic: This one describes fortran and c libraries for doing multiplication of sparse matricies. Since you'll probably want to run pagerank offline anyway, implementing in a different language than your crawler shouldn't be a problem. The code (SMMP) is available for download here. This paper has a short explanation on sparse matrix multiplcation, but focusses on parallel algorithms. You can probably find more stuff by searching the web. Good luck! |
Department of Computer Science & Engineering University of Washington Box 352350 Seattle, WA 98195-2350 (206) 543-1695 voice, (206) 543-2969 FAX [comments to weld] |