websphinx
Class Crawler

java.lang.Object
  |
  +--websphinx.Crawler

public class Crawler
extends java.lang.Object

Web crawler.

To write a crawler, extend this class and override shouldVisit () create your own crawler. You can also modify this file.

To use a crawler:

  1. Initialize the crawler by calling setRoot() (or one of its variants) and setting other crawl parameters.
  2. Connect event listeners to monitor the crawler, such as websphinx.EventLog, websphinx.workbench.WebGraph, or websphinx.workbench.Statistics.
  3. Call run() to start the crawler.
A running crawler consists of queues of Links waiting to be visited and a set of threads retrieving pages in parallel. When a page is downloaded, it is passed to the crawler's expand() method to be expanded.


Constructor Summary
Crawler()
          Make a new Crawler.
 
Method Summary
 void addCrawlListener(websphinx.CrawlListener listen)
          Adds a listener to the set of CrawlListeners for this crawler.
 void addLinkListener(websphinx.LinkListener listen)
          Adds a listener to the set of LinkListeners for this crawler.
protected  void clearVisited()
          Clear the set of visited links.
 void expand(websphinx.Page page)
          Expand the crawl from a page.
 int getActiveThreads()
          Get number of threads currently working.
 websphinx.DownloadParameters getDownloadParameters()
          Get download parameters (such as number of threads, timeouts, maximum page size, etc.)
 boolean getIgnoreVisitedLinks()
          Get ignore-visited-links flag.
 int getMaxDepth()
          Get maximum depth.
 java.lang.String getName()
          Get human-readable name of crawler.
 int getPagesLeft()
          Get number of pages left to be visited.
 int getPagesVisited()
          Get number of pages visited.
 websphinx.Link getRoot()
           
protected  void markVisited(websphinx.Link link)
          Register that a CRC32 value of link's URL has been visited.
 void printStatus(java.io.PrintStream out)
          Print current status
 void run()
          Start crawling.
protected  void sendLinkEvent(websphinx.Link l, int id)
          Send a LinkEvent to all LinkListeners registered with this crawler.
protected  void sendLinkEvent(websphinx.Link l, int id, java.lang.Throwable exception)
          Send an exceptional LinkEvent to all LinkListeners registered with this crawler.
 void setDownloadParameters(websphinx.DownloadParameters dp)
          Set download parameters (such as number of threads, timeouts, maximum page size, etc.)
 void setHostRoot(java.lang.String name)
          Set the host name of the root so that the crawler only visits root's family web sites.
 void setIgnoreVisitedLinks(boolean f)
          Set ignore-visited-links flag.
 void setMaxDepth(int maxDepth)
          Set maximum depth.
 void setName(java.lang.String name)
          Set human-readable name of crawler.
 void setRoot(websphinx.Link link)
          Set starting point of crawl as a single Link.
 boolean shouldVisit(websphinx.Link l)
          Callback for testing whether a link should be traversed.
 void stop()
          stop crawling
 void submit(websphinx.Link link)
          Puts a link into the crawling queue.
 java.lang.String toString()
          Convert the crawler to a String.
 boolean visited(websphinx.Link link)
          Test whether the page corresponding to a link has been visited (or queued for visiting).
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, wait, wait, wait
 

Constructor Detail

Crawler

public Crawler()
Make a new Crawler.

Method Detail

printStatus

public void printStatus(java.io.PrintStream out)
Print current status

Parameters:
out - PrintStream which stream to print

run

public void run()
         throws java.lang.Exception
Start crawling. Returns either when the crawl is done, or when stop() is called.

java.lang.Exception

stop

public void stop()
stop crawling


shouldVisit

public boolean shouldVisit(websphinx.Link l)
Callback for testing whether a link should be traversed. This is modified so that the crawler only visits washington web sites and http pages. Also, it ignores various file types that are not html.

Parameters:
l - Link encountered by the crawler
Returns:
true if link should be followed, false if it should be ignored.

expand

public void expand(websphinx.Page page)
Expand the crawl from a page. The default implementation of this method tests every link on the page using shouldVisit (), and submit()s the links that are approved. A subclass may want to override this method if it's inconvenient to consider the links individually with shouldVisit().

Parameters:
page - Page to expand

getPagesVisited

public int getPagesVisited()
Get number of pages visited.

Returns:
number of pages processed so far in this crawl

getPagesLeft

public int getPagesLeft()
Get number of pages left to be visited.

Returns:
number of links approved by shouldVisit() but not yet visited

getActiveThreads

public int getActiveThreads()
Get number of threads currently working.

Returns:
number of threads downloading pages

getName

public java.lang.String getName()
Get human-readable name of crawler. Default value is the class name, e.g., "Crawler". Useful for identifying the crawler in a user interface; also used as the default User-agent for identifying the crawler to a remote Web server. (The User-agent can be changed independently of the crawler name with setDownloadParameters().)

Returns:
human-readable name of crawler

setName

public void setName(java.lang.String name)
Set human-readable name of crawler.

Parameters:
name - new name for crawler

toString

public java.lang.String toString()
Convert the crawler to a String.

Overrides:
toString in class java.lang.Object
Returns:
Human-readable name of crawler.

getRoot

public websphinx.Link getRoot()

setRoot

public void setRoot(websphinx.Link link)
Set starting point of crawl as a single Link.

Parameters:
link - starting point

setHostRoot

public void setHostRoot(java.lang.String name)
Set the host name of the root so that the crawler only visits root's family web sites.

Parameters:
name - root's host name(eg. washington.edu)

getIgnoreVisitedLinks

public boolean getIgnoreVisitedLinks()
Get ignore-visited-links flag. Default value is true.

Returns:
true if search skips links whose URLs have already been visited (or queued for visiting).

setIgnoreVisitedLinks

public void setIgnoreVisitedLinks(boolean f)
Set ignore-visited-links flag.

Parameters:
f - true if search skips links whose URLs have already been visited (or queued for visiting).

getMaxDepth

public int getMaxDepth()
Get maximum depth. Default value is integer maximum.

Returns:
maximum depth of crawl, in hops from starting point.

setMaxDepth

public void setMaxDepth(int maxDepth)
Set maximum depth.

Parameters:
maxDepth - maximum depth of crawl, in hops from starting point

getDownloadParameters

public websphinx.DownloadParameters getDownloadParameters()
Get download parameters (such as number of threads, timeouts, maximum page size, etc.)


setDownloadParameters

public void setDownloadParameters(websphinx.DownloadParameters dp)
Set download parameters (such as number of threads, timeouts, maximum page size, etc.)

Parameters:
dp - Download parameters

addCrawlListener

public void addCrawlListener(websphinx.CrawlListener listen)
Adds a listener to the set of CrawlListeners for this crawler. If the listener is already found in the set, does nothing.

Parameters:
listen - a listener

addLinkListener

public void addLinkListener(websphinx.LinkListener listen)
Adds a listener to the set of LinkListeners for this crawler. If the listener is already found in the set, does nothing.

Parameters:
listen - a listener

submit

public void submit(websphinx.Link link)
Puts a link into the crawling queue. If the crawler is running, the link will eventually be retrieved.

Parameters:
link - Link to put in queue

sendLinkEvent

protected void sendLinkEvent(websphinx.Link l,
                             int id)
Send a LinkEvent to all LinkListeners registered with this crawler.

Parameters:
l - Link related to event
id - Event id

sendLinkEvent

protected void sendLinkEvent(websphinx.Link l,
                             int id,
                             java.lang.Throwable exception)
Send an exceptional LinkEvent to all LinkListeners registered with this crawler.

Parameters:
l - Link related to event
id - Event id
exception - Exception associated with event

visited

public boolean visited(websphinx.Link link)
Test whether the page corresponding to a link has been visited (or queued for visiting). It uses CRC32 to test.

Parameters:
link - Link to test
Returns:
true if link has been passed to walk() during this crawl

markVisited

protected void markVisited(websphinx.Link link)
Register that a CRC32 value of link's URL has been visited.

Parameters:
link - Link that has been visited

clearVisited

protected void clearVisited()
Clear the set of visited links.