|
CSE Home | About Us | Search | Contact Info |
|
Project Part 1 - Crawler SpeedupDue: October 22, 12:00 noon. See below for hand-in details.OverviewTo get you started, we are providing a simple and functional crawler. Your assignment is to improve the performance of the crawler by 1) modifying the queue structure in order to allow multiple threads to search in parallel (while maintaining politeness!), 2) Implement finger-printing so duplicate pages (found via different URLs) are ignored, and 3) Implement checkpointing so the crawler can resume from a crawl after a crash. You'll conduct some simple experiments to demonstrate the effectiveness of your modifications, and you'll turn in a writeup explaining what you did. We grade both on the quality of your code, the attention to the experimental demonstration, and the quality of your writen exposition. Since there is room for many other improvements, we will award extra credit for people who come up with additional improvements and explain them well. Despite our best efforts, the code we are handing out is likely buggy. People who find and report bugs to the class mailing list help everyone, and we will track and reward this behavior as well.Before starting this assignment, be sure to completely read the Mercator paper. The crawler we are using is an enhanced and debugged version of Robert Miller's WebSphinx, implemented at CMU and originally reported in a paper in WWW7. Unfortunately, although WebSphinx is nicely designed, it was intended for a different purpose: mapping a small website and providing an attractive graphical display. This mapping and display functionality used a huge amount of memory and rendered WebSpinx unsuitable for large-scale crawling. Previous 454 students were very unhappy with WebSphinx. As a result, we have greatly modified the code, eliminating unnecessary functionality, fixing many memory problems and making some other improvements in order to improve scalability. A document describing the resulting crawler is the best place to start reading. PolitenessThe first step in building a successful web crawler is to learn the proper etiquette. Virtually all content providers frown on misuse of their resources by poorly written crawlers that try to download big chunks of their sites all in the space of only several seconds. It might even be the case that large numbers of hits from the same site look like a coordinated denial of service attack! So the foremost principle you should be thinking about in this part of the project is how to be a well-behaved net citizen. A few important guidelines to follow are to make sure that your (automated) crawlers:
In addition, many content providers wish to prevent access by automated spiders to certain parts of their sites. To this end, the robots exclusion protocol was created; it allows site owners to specify which parts of the site are accessible to which web robots. A good place to start reading about this is at the Web Robots Pages. There you will find links to the exclusion standard, a list of active robots, and a FAQ. To view a quick example, the CSE department has its own robots.txt file. Another page at the robots site (slightly outdated, but still useful) is the Guidelines page. In addition, you may want to take a look at David Eichmann's Ethical Web Agents paper. Failure to observe these principles could have serious consequences:
Getting StartedYour work areas are in either:/projects/instr/cse454-1Please use the following machines to test your crawler (especially for long runs and experiments): (login as normal and then at the prompt ssh into one of these two machines) rockhopperFind your directory and copy proj1 into it. The code should provide a simple and functional crawler. Get very familiar with
websphinx/Crawler.java . You will have to modify (and probably add)
methods in this class. Here is an API created by javadoc that will show
the other classes that you will use, but most likely just at an interface
level.
Compile in the directory containing javac PoliteCrawler.javaTo run the crawler: Java -Xmx256m PoliteCrawlerThe -Xmx256 option allows the use of more memory, which may
be useful later, but isn't necessary to begin. If you want to brush up on
Java, there are many books, but my favorite is Arnold
and Gosling's.
For this assignment, please make sure not to let your crawler touch any
servers outside the
Most (or all) code modifications should take place within
What to Hand InHand in the URL of a top-level web page that lists your name and contact information as well as the information described below. Remember, you will be graded both on the quality of the artifact you have built and the way it is described. Make sure your explanations are clear and experimental results presented in a way that is easy to understand. Separate the answers to the parts below in a clear manner. Part One: PolitenessIs the crawler polite? Explain.To Hand In: A brief English answer (covering all aspects of politeness) and explanation. If changes are required to make the crawler polite, show them and explain. Part Two: FocusModify the code so that it is possible to search from a given root URL and only visit pages in the root's directory or in a subdirectory of that containing the root. I.e. limit search to URLs whose directory stem is a superset of the root's. The current implementation starts at the given root URL (set inPoliteCrawler.java ), but then only verifies
that a link's host matches the root host (also set in
PoliteCrawler.java ). Being able to limit search in this
fashion may help you debug and understand the crawling behavior, but you
will eventually (only when confident) want to let your crawler roam more
freely within the CS domain.
To Hand In: A listing of the code you wrote or modified for this part of the assignment. Make sure that your additions are clearly marked. Part Three: Finger-printingImplement finger-printing so that pages which are reached via different URLs, but which contain the same content, are ignored. Remember, you don't need to write the detailed hashing code from scratch if you can find it on the web (just be clear where you got code if you didn't write it yourself).To Hand In: A listing of the code you wrote, downloaded, or modified for this part of the assignment. Also include a brief description of your approach. What algorithm did you use for fingerprinting and why was it a good choice? Part Four: Parallel CrawlingReorganize the queue datastructure so that the crawler has improved parallelism, while still ensuring politeness. Currently a simple, single queue (implemented as aVector ) is used, and it does not take
advantage of any multithreading (which is seen since each thread locks the
Crawler class while it fetches a link, downloads the page,
examines it, and then does any updates) and may have problems with memory.
Create an interface for a more sophisticated queuing system and implement a
queue that takes advantage of the multiple threads. Perhaps use a separate
sub-queue for each thread, using a hash function in order to ensure that no
host is ever added to more than one queue (facilitating politeness). Note
that each thread (Worm) has a unique ID. This may suggest an interface
such as:
void BetterQueue.init(int numWorms); void BetterQueue.add(Link, WormID); Link BetterQueue.remove(WormID); Here are three other issues to consider when working on this part of the project:
To complete this part of the project, you will need to use multiple
threads and synchronization. The existing code has examples of how the
threads are created and how synchronization is used. Remember that data
members of
To Hand In: A listing of the code you wrote or modified for this part of the assignment. Also include a brief description of your approach. Besides an overall explanation, please include answers to the following questions: 1) what did you synchronize and why? 2) how does your design improve crawling throughput? 3) why are you sure that the code is deadlock free? 4) how is politeness maintained? Part Five: CheckpointingImplement check-pointing so that a crawler can be restarted and continue at a point where it left off or crashed. You'll need to figure out how to get the threads to pause until they all get in a consistent state and then serialize all of the appropriate data structures. Provide some mechanism (such as an optional command-line argument) so that the user can set the frequency with which the checkpoints occur.To Hand In: A listing of the code you wrote or modified for this part of the assignment. Also include a brief description of your approach. Besides an overall explanation, please include answers to the following questions: 1) did you have to do more synchronization? What and why? 2) if so, why are you sure that the code is deadlock free? 3) how did you test whether checkpointing worked correctly? Part Six: OptionalList any other innovations (if any) which you pursued.Part Seven: ExperimentsOnce you have your crawler enhanced as above, start it running on a long crawl over the CS domain, but still don't let it loose outside the CS dept. Compare its performance to the raw crawler we gave you.To Hand In: A set of metrics comparing the performance of the two crawlers. This data should be presented in tabular format as follows. There should be a row for each statistic in the list below, and there should be a column for the basic crawler as well as one for the crawler with extensions activated.
Groups & CollaborationYou will work alone on this first part of the project. Discussions among different students are allowed, subject to the Giligan's Island rule and other important directives listed in the class policies. Directly copying another person's work (architectures, diagrams, or code) is cheating and is considered a very serious offense.Important NoteIf you get stuck or can't complete every part of the assignment, do as much as you can. Partial credit (and extra credit!) will definitely be awarded. If a bug or software glitch gets you, let us know as soon as possible (we'll give credit for finding these, and even more credit for finding a solution or workaround) - but keep working on other parts of the assignment. |
Department of Computer Science & Engineering University of Washington Box 352350 Seattle, WA 98195-2350 (206) 543-1695 voice, (206) 543-2969 FAX |