Polite Crawler
Paul Han & Jae-Hoon Sul
April 23rd, 2002
Abstract
This paper describes Polite Crawler, a web crawler that is designed to
behave politely. It is based on WebSPHINX that is simple and
written in Java. The paper description also includes major components of
Polite Crawler. At last, we aim to discuss our approaches to implement ¡°politeness¡± and to
solve various problems as we have encountered, such as memory leaks in the crawler.
1. Introduction
The
first thing we learned when implementing our crawler was the proper etiquette
of a web crawler. Since almost all web masters do not like web crawlers to
overuse their resources (e.g. downloading all of their pages in seconds), it is
crucial that our crawler behaves politely. The meaning of ¡°politeness¡± will be
discussed in section 2.
Since
WebSPHINX is written in Java, our crawler is also written entirely in Java.
Although WebSPHINX provides a number of useful functionalities such as support
for multi threads and robot exclusion standard, it also has a
serious problem; memory leaks. The original version of WebSPHINX can only visit
a few hundred pages due to the problem. We used various techniques to solve the
problem, which will be described in section 3.4.
In
this paper, we assume that the reader is familiar with Java and knows how web
crawlers generally work. Please read Mercator
paper if you want to learn further about web crawlers.
2. Politeness
The specific meanings of ¡°politeness¡± include the following aspects;
ü
Space out requests to any given server and
do not request the same content over and over again.
ü
Never hit any given site more than a few
times a minute (our crawler visits the same server 10 times per minute).
ü
Avoid parts of a server that are
restricted (Robots exclusion protocol)
ü
Don't request multiple copies of the same
information from a server.
ü
Don't request information we don't need.
ü
Only visit URLs that specify the http protocol.
ü
Don¡¯t visit dynamic web pages (cgi, asp, php, etc.)
ü
Use our email address as user agent field of the
crawler
The complete description of ¡°politeness¡±,
however, should not be restricted to the above list as there may be and probably are other aspects that cause web masters to be unsatisfied with how crawlers behave. In fact, we added the last three rules after we had run our crawler a number of times. Thus it
is important to know how your crawler is working prior to actual running.
3. Architecture of Polite Crawler
3.1 Below is a
diagram that shows how Polite Crawler works.
There
are a number of components in Polite Crawler, and each is described in detail
below.
3.2 Crawler (run ())
When the
run method is called, crawler creates and initializes all threads (or worms)
and fetch queues and submits a root URL to start crawling from. Then it goes an infinite loop that
checks whether crawling is finished or not. Each time that threads visit 1,000 more pages, the
crawler saves current statistics to disk. It stores 10 statistics in one file,
so statistics of the first 10,000 pages are saved in stats0.txt and statistics of
the next
10,000 pages are saved in stats1.txt and so on. Since the crawler can be stopped by several problems such as out-of-memory
errors and unexpected exceptions, the crawler periodically saves its contents
to disk so that when we run the crawler next time, the crawler can continue to
crawl with the saved information rather than crawling from the beginning (You
must do this!).
3.3 Worm
3.3.1 Fetch method
Worm is an extended class from the Java Thread class. The Crawler creates worms and calls the run method of worms. Worm calls the fetch method, which is an infinite loop. What each worm does in fetch() is to get a link from the fetch queue to crawl, save a downloaded page to disk, and expand links from the page.
Once the worm
gets
a link to crawl, a timer is set, and the worm tries to download the page. If timeout period passes before
downloading a page, the worm dies, and new worm starts to continue crawling.
3.3.2 Downloading pages
There
are a number of exceptions that can occur while downloading pages. One of them is robot exclusion; there
are pages that robots should not crawl, so the page will not be
crawled and an exception will be thrown.
In addition, various exceptions of network related problems can occur as
well. All of them will be caught
and the
worm will just get the next link to crawl. When the worm successfully downloads the page, the page
is parsed and stored in the Link class. Then, the process method is called to
process the page. When processing
page is done, the content of page is discarded, and the worm goes back to the
beginning of the loop to fetch a new link to crawl.
3.3.3 Process and Expand methods
The Expand
method retrieves links from the page and several tests
are done on links before submitting them to fetch queues. The worm first checks if it
has
already seen the link. Then, it
checks if it should visit the link.
Finally, it checks if the depth of link is too deep. When the link passes
all the tests, it is submitted to the appropriate queue.
3.3.4 Visited and ShouldVisit methods
We
keep information about URLs of visited pages using the Java TreeSet class. Storing URL strings would require much more memory
space, so we save a CRC32 value of the URL. Testing whether a link should be visited is not
complicated. Since the
crawler is designed to visit only UW and its family web sites, it
first checks if the link¡¯s host ends with washington.edu. The crawler can visit a different host
by using the setHostRoot method, so if the crawler is intended to visit
Stanford University and its family websites, call setHostRoot(¡°standford.edu¡±) before
calling the run method. Also, we are interested only in HTTP,
so we check if the link is HTTP protocol and standard port number 80 for HTTP. The reason why we check port numbers is
that web pages using
port numbers different
from 80 seem to have special purposes, and
it would be better not to visit those pages. Then it checks file extension of the link. Since we do not process images, we do not visit image
files. If the file extension is one of file
extensions of dynamic pages such as php, asp, and jsp, the link is not
visited. If URLs contain the word,
¡°query,¡± they are also considered as dynamic pages and are not visited (e.g.
www.washingon.edu/something.html?query=¡¯something¡¯) Also, there are many other file extensions that indicate files are not texts, and URLs with
those file extensions
should not be visited.
4. How to build
and use Polite Crawler
4.1 files
We
mainly modified WebSPHINX crawler, which is websphinx/Crawler.java file.
PoliteCrawler.java shows how to use and run the crawler. There are a number of
other files that WebSPHINX needs to run the crawler. We have eliminated
WebSPHINX files except these files. We recommend you to read
WebSPHINX API documentation to understand what each file is for (most of them
are related HTMLParser).
4.2 Setting parameters of the crawler
Before
running the crawler, you should change DownloadParameters and other parameters
of the crawler. For instance, you must set a root URL of the crawler, which can
be done by crawler.setRoot method. You must also change user agent field of the
crawler by setting DownloadParameters (dp.changeUserAgent(your email address)).
It is important that you
use your email address as user agent field since if your crawler offends some
web masters, this approach might offer the possibility that their complaint
would come directly to you and not to a public forum. If you do not change user agent field, the crawler
does not work. If you want to see how your crawler is working, use EventLog
class. EventLog tells you about current activities of your crawler; pages
that it downloaded, I/O exceptions that it threw, and so on. To use EventLog,
use following codes before calling run();
EventLog
stats = new EventLog();
stats.monitor(crawler);
Read
PoliteCrawler.java to see other useful parameters of the crawler.
4.3 Building and running Polite Crawler
You may use PoliteCrawler.java or create
your own class that contains the Crawler class. To build Polite Crawler, type
javac
PoliteCrawler.java
If
you modify websphinx/Crawler.java, typing above line will rebuild your crawler.
To run Polite Crawler, type
java
PoliteCrawler
Although
we have tried several techniques to solve out-of-memory problem in
WebSPHINX, the current version of Polite Crawler still has the same problem.
Thus you should increase the maximum limit of heap memory to at least 256MB or
more (using –Xmx option). For example,
java
–Xmx256m PoliteCrawler
sets the maximum limit of heap memory to
256 MB. Polite Crawler visits without checkpoint approximately 250,000 pages
when the maximum limit is 256 MB.
We
have built and tested Polite Crawler with Java 1.3.1, so it is important that
you have installed Java 1.3.1 or higher. Note: you cannot build Polite Crawler
with Visual J++.
4.4 Debugging the Crawler
There
are numerous tools to debug Java, and we used a tool called OptimizeIt. This tool reports memory
usage by class, which has helped us to understand the causes of
out-of-memory problem. It has other functionalities that are very useful when
debugging the crawler. One of them is called Code Coverage that shows the
number of times that certain part of code is executed. This tool has helped us
find deadlock in our code.