CSE490i Project Part 1

University of Washington Department of Computer Science & Engineering

CSE490i Project Part 1: Crawling the Web

CSE Home

About Us

Contact Info

Administrivia

Home

Using course email

Email archive

Policies

Content

Overview

Resources

Lecture slides

Assignments

Reading

Project

Administrivia

Due Date: Tuesday, January 29, 12:00 noon.
Caution: Be sure to read the entire assignment from start to finish before you start working on it; there are numerous cases where a bad design decision early could lead to serious problems at later stages.
Objective: Build an effective Web search site, one whose precision rivals Google, for the subset of pages hosted by servers at the University of Washington.
Groups & Collaboration: You will work in groups on this project. You need to e-mail your TA with information about who will work in which group by Tuesday, January 15. (One e-mail per group is requested.) Groups help brainstorm ideas more effectively, reduce workload and increase efficiency. However, no group may have more than 3 members. Every group member will be responsible to know and be able to explain what other group members are doing or have done. Discussions among different groups are allowed, subject to the Giligan's Island rule and other important directives listed in the class policies. Directly copying another group's work (architectures, diagrams, or code) is cheating and is considered a very serious offense.
Politeness

In this project you're going to construct a web crawler that automatically traverses the web, extracting and storing information about HTML files. For now, do not concern yourself with fancy user interfaces and/or storing data in databases - you will be asked to add some of that functionality as part of follow-on projects.
The first step in building a successful web crawler is to learn the proper etiquette. Virtually all content providers frown on misuse of their resources by poorly written crawlers that try to download big chunks of their sites all in the space of only several seconds. It might even be the case that large numbers of hits from the same site look like a coordinated denial of service attack! So the foremost principle you should be thinking about in this part of the project is how to be a well-behaved net citizen. A few important guidelines to follow are to make sure that your (automated) crawlers:

Space out requests to any given server and do not request the same content over and over again!

Never hit any given site more than a few times a minute (10 times per minute seems like an absolute maximum here and even that may be too much)!

Avoid parts of a server that are restricted (see below.)

Don't request multiple copies of the same information from a server.

Don't request information you don't need (e.g. image files you don't process.)

To help you and your crawler learn the good manners, we will provide a web server (details to follow), on which you can do limited experiments to the point where you are confident that your code works unobtrusively. Only then you can proceed to hitting real web servers and be sure that they will not go after you later. Correctness and safety considerations come first, efficiency - only after that!
In addition, many content providers wish to prevent access by automated spiders to certain parts of their sites. To this end, the robots exclusion protocol was created; it allows site owners to specify which parts of the site are accessible to which web robots. A good place to start reading about this is at the Web Robots Pages. There you will find links to the exclusion standard, a list of active robots, and a FAQ. To view a quick example, the CSE department has its own robots.txt file. Another page at the robots site (slightly outdated, but still useful) is the Guidelines page. In addition, you may want to take a look at David Eichmann's Ethical Web Agents paper.
We recommend that you write an API layer on top of HTTP which ensures that you never violate frequency of connection or other robots.txt principles. Failure to observe these principles could have serious consequences:

Your account could get turned off by the CSE lab (this has happened in the past for exactly the same reason: complaint by an outside webmaster).

The class professor and TA could get reprimanded by the Department Chair.

If your account is disabled, it will be very difficult to do work, so you'll likely turn in a poor project and get a bad grade. Don't expect an extension.

If the professor and TA get in trouble because of your carelessness, how do you think they will grade your project when end of quarter comes?

Crawling the Web

Once you have read up on how to write a well-behaved robot, you can start thinking about how you are going to search the web. Where will you start? How do you extract the links from a web page? What about relative links? Once you have a set of links, in which order should you follow them? Given the latency of HTTP, should your spider be following multiple links in parallel? And especially, how will you ensure that your spider is responsible? (Other issues, such as indexing and query processing will be handled in subsequent parts of the project, so you can ignore them for the most part in this first phase).
Since the dominant language on the web today is HTML (and XML to a much smaller extent), you will need to be able to parse HTML files. Various HTML parsers exist for practically every language. Java has (at least) one built-in: see javax.swing.text.html.parser.DocumentParser for more details. This interface allows you to register callbacks that are called when a tag is opened or closed; you can then build up a tree-node representation of the document, or process only certain tags, as necessary. The base code we will be providing also has functionality for parsing web pages. Of course, you are welcome to use any HTML parser you wish (provided you give proper credit to its authors) or build one yourself.
In it's most simple form, a web crawler consists of a queue of URLs, a component to download a URL, and components to parse and analyze the url (which can including extracting links for addition to the URL queue and indexing the text content for later search.) Of course there are many details such as how to select URLs off the queue to maximize performance without violating politeness, how to implement multiple parallel downloads and analyses, avoiding duplication of work (like revisiting URLs or visiting mirrored pages.) There are many solutions to these problems and different ones may be appropriate for different tasks and situations. An excellent example of a web crawler where each of these design decisions is address is Mercator a scalable and extensible web crawler developed at Compaq.
Plan for extensibility
Future projects might include extensions such as improved indexing of web pages, using a database back-end to store crawler state such as indexed pages and the URL queue, analysis of the way pages are linked together or the type of content they contain. Be prepared for these types of extensions.
In addition to the links it provides, each web page you parse could also contain relevant information you might wish to store. For this assignment you will only need to analyze the HTTP header and extract links, but for future assignments you will need to index the text on a page.
Scaling up the Crawler

Let us now imagine you already have a functional crawler. To be able to build up a large collection of data quickly, a good idea is to "parallelize" your crawler so that it visits sites simultaneously. You can do this by using threads.
Caveat: With multiple threads running at the same time it is much easier to forget the much needed policies and start sending too many requests (the total from all your threads) to a given site in a short time interval. Be careful not to fall victim to this mistake!
Project Specifics

We provide a working crawler, which you are expected to extend. You may also choose to write your own "core" crawler (in either Java or C++), but be aware that there is plenty of challenge in this project, so you might be incurring too much work on yourself.
You have a choice of two systems upon which to base your crawler. Both are primarily designed for crawling a single website. Part of your job will be extending them to be both polite and scalable enough to crawl out on the web. The first, WebSphinx is a java based system. This is a research and teaching system, so it's fairly simple and written with extensibility in mind. The second, ht://dig is a C++ based crawler which more closely resembles a production system. This means that it's a little more powerful and also more complex.
Your assignment is:

Extend the crawler to do parallel downloads, while maintaining politeness.

Use your crawler to crawl a test web site, which we will provide. Make sure that it never leaves this domain until you have verified that it is working correctly.

Once sure it is ok, set it loose on the CSE and later UW domain. Keep it limited to hosts in these domains. Your objective is to produce a variety of statistics both about the website and your crawler. Some of these statistics include timing, so you will have to instrument your crawler with timing code.

Implement one or more enhancement to your system.

Fingerprinting - a method for identifying duplicate content

Smarter search control - can you improve on the standard breadth- or depth-first search order.

Spam detection.

Any other extension you can think of, just OK it with us first.

What to Hand In

Hand in the URL of a top-level web page that lists your team name and contact information for each member. Remember, you will be graded both on the quality of the artifact you have built and the way it is described. Make sure your explanations are clear and experimental results presented in a way that is easy to understand. At a minimum the web page(s) should explain:

A description of your crawler. How does it work? Discuss the method for ensuring politeness (i.e. safety of target web sites), the web search strategy, the approach to extraction of content, as well as the kinds of problems you faced and how you attacked them.

The full source code of your crawler, along with instructions (in a README file) about building and running it.

A set of metrics on your crawler's performance, both with and without your extension. This data should be presented in tabular format as follows. There should be a row for each statistic in the list below, and there should be a collumn for the basic crawler as well as one (or more) for the crawler with extensions activated.

CPU Time consumed

Number of pages visited

CPU time per page

Min., Max. and Avg. links per page

Number of unique servers visited

How many of each type of webserver (e.g. apache, IIS, etc.) were encountered

Min., Max. and Avg. delays between hits to a given server

Some set of statistics to demonstrate the benefit of your extension. This could be number of duplicate pages found if you did fingerprinting, number of unique pages crawled if you did search control, number of spamming sites identified, etc.

Optional: record linkage information for the UW domain. For each host server found, how many pages on other servers point to this server? Return the top 25 linked sites in sorted order.

Note: If you get stuck or can't complete every part of the assignment, do as much as you can. If you try an ambitious method for information extraction, we understand you may not have as much time for other parts and will take this into account. Partial credit (and extra credit!) will definitely be awarded. If a bug or software glitch gets you, let us know as soon as possible (we'll give credit for finding these, and even more credit for finding a solution or workaround) - but keep working on other parts of the assignment.
Additional Useful Pointers

Instructions, hints and troubleshooting ideas (to be added as we make progress)

Here are some notes about memory issues in WebSphinx

The Websphinx API documentation is available

A mirror of the Websphinx API documentation is also available to allow for testing of fingerprinting.

The API docs are also available in tarred gzipped format if you want to install it locally.

Good luck!

Department of Computer Science & Engineering
University of Washington
Box 352350
Seattle, WA 98195-2350
(206) 543-1695 voice, (206) 543-2969 FAX
[comments to weld]