Polite Crawler

 

Paul Han & Jae-Hoon Sul

April 23rd, 2002

 

Abstract

 

This paper describes Polite Crawler, a web crawler that is designed to behave politely. It is based on WebSPHINX that is simple and written in Java. The paper description also includes major components of Polite Crawler. At last, we aim to discuss our approaches to implement ¡°politeness¡± and to solve various problems as we have encountered, such as memory leaks in the crawler.

 

1. Introduction

            The first thing we learned when implementing our crawler was the proper etiquette of a web crawler. Since almost all web masters do not like web crawlers to overuse their resources (e.g. downloading all of their pages in seconds), it is crucial that our crawler behaves politely. The meaning of ¡°politeness¡± will be discussed in section 2.

            Since WebSPHINX is written in Java, our crawler is also written entirely in Java. Although WebSPHINX provides a number of useful functionalities such as support for multi threads and robot exclusion standard, it also has a serious problem; memory leaks. The original version of WebSPHINX can only visit a few hundred pages due to the problem. We used various techniques to solve the problem, which will be described in section 3.4.

            In this paper, we assume that the reader is familiar with Java and knows how web crawlers generally work. Please read Mercator paper if you want to learn further about web crawlers.

 

2. Politeness

The specific meanings of ¡°politeness¡± include the following aspects;

ü        Space out requests to any given server and do not request the same content over and over again.

ü        Never hit any given site more than a few times a minute (our crawler visits the same server 10 times per minute).

ü        Avoid parts of a server that are restricted (Robots exclusion protocol)

ü        Don't request multiple copies of the same information from a server.

ü        Don't request information we don't need.

ü        Only visit URLs that specify the http protocol.

ü        Don¡¯t visit dynamic web pages (cgi, asp, php, etc.)

ü        Use our email address as user agent field of the crawler

The complete description of ¡°politeness¡±, however, should not be restricted to the above list as there may be and probably are other aspects that cause web masters to be unsatisfied with how crawlers behave. In fact, we added the last three rules after we had run our crawler a number of times. Thus it is important to know how your crawler is working prior to actual running.

3. Architecture of Polite Crawler

 

3.1 Below is a diagram that shows how Polite Crawler works.

 

 

 

 

 

 

            There are a number of components in Polite Crawler, and each is described in detail below.

 

3.2 Crawler (run ())

When the run method is called, crawler creates and initializes all threads (or worms) and fetch queues and submits a root URL to start crawling from.  Then it goes an infinite loop that checks whether crawling is finished or not.  Each time that threads visit 1,000 more pages, the crawler saves current statistics to disk. It stores 10 statistics in one file, so statistics of the first 10,000 pages are saved in stats0.txt and statistics of the next 10,000 pages are saved in stats1.txt and so on.  Since the crawler can be stopped by several problems such as out-of-memory errors and unexpected exceptions, the crawler periodically saves its contents to disk so that when we run the crawler next time, the crawler can continue to crawl with the saved information rather than crawling from the beginning (You must do this!).

 

3.3 Worm

            3.3.1 Fetch method

            Worm is an extended class from the Java Thread class.  The Crawler creates worms and calls the run method of worms.  Worm calls the fetch method, which is an infinite loop.  What each worm does in fetch() is to get a link from the fetch queue to crawl, save a downloaded page to disk, and expand links from the page. 

Once the worm gets a link to crawl, a timer is set, and the worm tries to download the page.  If timeout period passes before downloading a page, the worm dies, and new worm starts to continue crawling. 

            3.3.2 Downloading pages

            There are a number of exceptions that can occur while downloading pages.  One of them is robot exclusion; there are pages that robots should not crawl, so the page will not be crawled and an exception will be thrown.  In addition, various exceptions of network related problems can occur as well.  All of them will be caught and the worm will just get the next link to crawl.  When the worm successfully downloads the page, the page is parsed and stored in the Link class. Then, the process method is called to process the page.  When processing page is done, the content of page is discarded, and the worm goes back to the beginning of the loop to fetch a new link to crawl.

            3.3.3 Process and Expand methods

The Expand method retrieves links from the page and several tests are done on links before submitting them to fetch queues.  The worm first checks if it has already seen the link.  Then, it checks if it should visit the link.  Finally, it checks if the depth of link is too deep.  When the link passes all the tests, it is submitted to the appropriate queue.

            3.3.4 Visited and ShouldVisit methods

            We keep information about URLs of visited pages using the Java TreeSet class.  Storing URL strings would require much more memory space, so we save a CRC32 value of the URL.  Testing whether a link should be visited is not complicated.  Since the crawler is designed to visit only UW and its family web sites, it first checks if the link¡¯s host ends with washington.edu.  The crawler can visit a different host by using the setHostRoot method, so if the crawler is intended to visit Stanford University and its family websites, call setHostRoot(¡°standford.edu¡±) before calling the run method. Also, we are interested only in HTTP, so we check if the link is HTTP protocol and standard port number 80 for HTTP.  The reason why we check port numbers is that web pages using port numbers different from 80 seem to have special purposes, and it would be better not to visit those pages.  Then it checks file extension of the link.  Since we do not process images, we do not visit image files.  If the file extension is one of file extensions of dynamic pages such as php, asp, and jsp, the link is not visited.  If URLs contain the word, ¡°query,¡± they are also considered as dynamic pages and are not visited (e.g. www.washingon.edu/something.html?query=¡¯something¡¯) Also, there are many other file extensions that indicate files are not texts, and URLs with those file extensions should not be visited.

 

4. How to build and use Polite Crawler

            4.1 files

            We mainly modified WebSPHINX crawler, which is websphinx/Crawler.java file. PoliteCrawler.java shows how to use and run the crawler. There are a number of other files that WebSPHINX needs to run the crawler. We have eliminated WebSPHINX files except these files. We recommend you to read WebSPHINX API documentation to understand what each file is for (most of them are related HTMLParser).

            4.2 Setting parameters of the crawler

            Before running the crawler, you should change DownloadParameters and other parameters of the crawler. For instance, you must set a root URL of the crawler, which can be done by crawler.setRoot method. You must also change user agent field of the crawler by setting DownloadParameters (dp.changeUserAgent(your email address)). It is important that you use your email address as user agent field since if your crawler offends some web masters, this approach might offer the possibility that their complaint would come directly to you and not to a public forum. If you do not change user agent field, the crawler does not work. If you want to see how your crawler is working, use EventLog class. EventLog tells you about current activities of your crawler; pages that it downloaded, I/O exceptions that it threw, and so on. To use EventLog, use following codes before calling run();

                        EventLog stats = new EventLog();

                        stats.monitor(crawler);

            Read PoliteCrawler.java to see other useful parameters of the crawler.

            4.3 Building and running Polite Crawler

             You may use PoliteCrawler.java or create your own class that contains the Crawler class. To build Polite Crawler, type

                        javac PoliteCrawler.java

            If you modify websphinx/Crawler.java, typing above line will rebuild your crawler. To run Polite Crawler, type

                        java PoliteCrawler

            Although we have tried several techniques to solve out-of-memory problem in WebSPHINX, the current version of Polite Crawler still has the same problem. Thus you should increase the maximum limit of heap memory to at least 256MB or more (using –Xmx option). For example,

                        java –Xmx256m PoliteCrawler

sets the maximum limit of heap memory to 256 MB. Polite Crawler visits without checkpoint approximately 250,000 pages when the maximum limit is 256 MB.

            We have built and tested Polite Crawler with Java 1.3.1, so it is important that you have installed Java 1.3.1 or higher. Note: you cannot build Polite Crawler with Visual J++.

            4.4 Debugging the Crawler

            There are numerous tools to debug Java, and we used a tool called OptimizeIt. This tool reports memory usage by class, which has helped us to understand the causes of out-of-memory problem. It has other functionalities that are very useful when debugging the crawler. One of them is called Code Coverage that shows the number of times that certain part of code is executed. This tool has helped us find deadlock in our code.