|
![]() |
![]() |
![]() |
![]() |
|
![]() |
AdministriviaDue Date: Tuesday, January 29, 12:00 noon. Caution: Be sure to read the entire assignment from start to finish before you start working on it; there are numerous cases where a bad design decision early could lead to serious problems at later stages. Objective: Build an effective Web search site, one whose precision rivals Google, for the subset of pages hosted by servers at the University of Washington. Groups & Collaboration: You will work in groups on this project. You need to e-mail your TA with information about who will work in which group by Tuesday, January 15. (One e-mail per group is requested.) Groups help brainstorm ideas more effectively, reduce workload and increase efficiency. However, no group may have more than 3 members. Every group member will be responsible to know and be able to explain what other group members are doing or have done. Discussions among different groups are allowed, subject to the Giligan's Island rule and other important directives listed in the class policies. Directly copying another group's work (architectures, diagrams, or code) is cheating and is considered a very serious offense. PolitenessIn this project you're going to construct a web crawler that automatically traverses the web, extracting and storing information about HTML files. For now, do not concern yourself with fancy user interfaces and/or storing data in databases - you will be asked to add some of that functionality as part of follow-on projects. The first step in building a successful web crawler is to learn the
proper etiquette. Virtually all content providers frown on misuse of
their resources by poorly written crawlers that try to download big
chunks of their sites all in the space of only several seconds. It
might even be the case that large numbers of hits from the same site
look like a coordinated denial of service attack! So the foremost
principle you should be thinking about in this part of the project
is how to be a well-behaved net citizen. A few important guidelines to
follow are to make sure that your (automated) crawlers:
To help you and your crawler learn the good manners, we will provide a web server (details to follow), on which you can do limited experiments to the point where you are confident that your code works unobtrusively. Only then you can proceed to hitting real web servers and be sure that they will not go after you later. Correctness and safety considerations come first, efficiency - only after that! In addition, many content providers wish to prevent access by automated spiders to certain parts of their sites. To this end, the robots exclusion protocol was created; it allows site owners to specify which parts of the site are accessible to which web robots. A good place to start reading about this is at the Web Robots Pages. There you will find links to the exclusion standard, a list of active robots, and a FAQ. To view a quick example, the CSE department has its own robots.txt file. Another page at the robots site (slightly outdated, but still useful) is the Guidelines page. In addition, you may want to take a look at David Eichmann's Ethical Web Agents paper. We recommend that you write an API layer on top of HTTP which ensures that you never violate frequency of connection or other robots.txt principles. Failure to observe these principles could have serious consequences:
Crawling the WebOnce you have read up on how to write a well-behaved robot, you can start thinking about how you are going to search the web. Where will you start? How do you extract the links from a web page? What about relative links? Once you have a set of links, in which order should you follow them? Given the latency of HTTP, should your spider be following multiple links in parallel? And especially, how will you ensure that your spider is responsible? (Other issues, such as indexing and query processing will be handled in subsequent parts of the project, so you can ignore them for the most part in this first phase). Since the dominant language on the web today is HTML (and XML to a much smaller extent), you will need to be able to parse HTML files. Various HTML parsers exist for practically every language. Java has (at least) one built-in: see javax.swing.text.html.parser.DocumentParser for more details. This interface allows you to register callbacks that are called when a tag is opened or closed; you can then build up a tree-node representation of the document, or process only certain tags, as necessary. The base code we will be providing also has functionality for parsing web pages. Of course, you are welcome to use any HTML parser you wish (provided you give proper credit to its authors) or build one yourself. In it's most simple form, a web crawler consists of a queue of URLs, a component to download a URL, and components to parse and analyze the url (which can including extracting links for addition to the URL queue and indexing the text content for later search.) Of course there are many details such as how to select URLs off the queue to maximize performance without violating politeness, how to implement multiple parallel downloads and analyses, avoiding duplication of work (like revisiting URLs or visiting mirrored pages.) There are many solutions to these problems and different ones may be appropriate for different tasks and situations. An excellent example of a web crawler where each of these design decisions is address is Mercator a scalable and extensible web crawler developed at Compaq. Plan for extensibilityFuture projects might include extensions such as improved indexing of web pages, using a database back-end to store crawler state such as indexed pages and the URL queue, analysis of the way pages are linked together or the type of content they contain. Be prepared for these types of extensions.In addition to the links it provides, each web page you parse could also contain relevant information you might wish to store. For this assignment you will only need to analyze the HTTP header and extract links, but for future assignments you will need to index the text on a page. Scaling up the CrawlerLet us now imagine you already have a functional crawler. To be able to build up a large collection of data quickly, a good idea is to "parallelize" your crawler so that it visits sites simultaneously. You can do this by using threads. Caveat: With multiple threads running at the same time it is much easier to forget the much needed policies and start sending too many requests (the total from all your threads) to a given site in a short time interval. Be careful not to fall victim to this mistake! Project SpecificsWe provide a working crawler, which you are expected to extend. You may also choose to write your own "core" crawler (in either Java or C++), but be aware that there is plenty of challenge in this project, so you might be incurring too much work on yourself. You have a choice of two systems upon which to base your crawler. Both are primarily designed for crawling a single website. Part of your job will be extending them to be both polite and scalable enough to crawl out on the web. The first, WebSphinx is a java based system. This is a research and teaching system, so it's fairly simple and written with extensibility in mind. The second, ht://dig is a C++ based crawler which more closely resembles a production system. This means that it's a little more powerful and also more complex. Your assignment is:
What to Hand InHand in the URL of a top-level web page that lists your team name and contact information for each member. Remember, you will be graded both on the quality of the artifact you have built and the way it is described. Make sure your explanations are clear and experimental results presented in a way that is easy to understand. At a minimum the web page(s) should explain:
Note: If you get stuck or can't complete every part of the assignment, do as much as you can. If you try an ambitious method for information extraction, we understand you may not have as much time for other parts and will take this into account. Partial credit (and extra credit!) will definitely be awarded. If a bug or software glitch gets you, let us know as soon as possible (we'll give credit for finding these, and even more credit for finding a solution or workaround) - but keep working on other parts of the assignment. Additional Useful Pointers
Good luck! |
![]() |
Department of Computer Science & Engineering University of Washington Box 352350 Seattle, WA 98195-2350 (206) 543-1695 voice, (206) 543-2969 FAX [comments to weld] |