: Class Crawler

Overview

Package

Class

Tree

Deprecated

Index

Help

PREV CLASS NEXT CLASS

FRAMES NO FRAMES

SUMMARY: INNER | FIELD | CONSTR | METHOD

DETAIL: FIELD | CONSTR | METHOD

websphinx
Class Crawler

java.lang.Object
  |
  +--websphinx.Crawler

All Implemented Interfaces:: java.lang.Runnable, java.io.Serializable

Direct Known Subclasses:: Search

public class Crawler
extends java.lang.Object
implements java.lang.Runnable, java.io.Serializable

Web crawler.

To write a crawler, extend this class and override shouldVisit () and visit() to create your own crawler.

To use a crawler:

Initialize the crawler by calling setRoot() (or one of its variants) and setting other crawl parameters.
Register any classifiers you need with addClassifier().
Connect event listeners to monitor the crawler, such as websphinx.EventLog, websphinx.workbench.WebGraph, or websphinx.workbench.Statistics.
Call run() to start the crawler.

A running crawler consists of a priority queue of Links waiting to be visited and a set of threads retrieving pages in parallel. When a page is downloaded, it is processed as follows:

classify(): The page is passed to the classify() method of every registered classifier, in increasing order of their priority values. Classifiers typically attach informative labels to the page and its links, such as "homepage" or "root page".
visit(): The page is passed to the crawler's visit() method for user-defined processing.
expand(): The page is passed to the crawler's expand() method to be expanded. The default implementation tests every unvisited hyperlink on the page with shouldVisit(), and puts each link approved by shouldVisit() into the crawling queue.

By default, when expanding the links of a page, the crawler only considers hyperlinks (not applets or inline images, for instance) that point to Web pages (not mailto: links, for instance). If you want shouldVisit() to test every link on the page, use setLinkType(Crawler.ALL_LINKS).

See Also:: Serialized Form

Field Summary

private Action action


static java.lang.String[] ALL_LINKS
          Specify ALL_LINKS as the link type to allow the crawler to visit any kind of link

private java.util.Vector classifiers


private Link[] crawledRoots


private java.util.Vector crawlListeners


private PriorityQueue crawlQueue


private boolean depthFirst


private java.lang.String[] domain


private DownloadParameters dp


private PriorityQueue fetchQueue


static java.lang.String[] HYPERLINKS
          Specify HYPERLINKS as the link type to allow the crawler to visit only hyperlinks (A, AREA, and FRAME tags which point to http:, ftp:, file:, or gopher: URLs).

static java.lang.String[] HYPERLINKS_AND_IMAGES
          Specify HYPERLINKS_AND_IMAGES as the link type to allow the crawler to visit only hyperlinks and inline images.

private boolean ignoreVisitedLinks


private java.util.Vector linkListeners


private LinkPredicate linkPredicate


private int maxDepth


private java.lang.String name


private int numLinksTested


private int numPagesLeft


private int numPagesVisited


private PagePredicate pagePredicate


private RobotExclusion robotExclusion


private java.lang.String[] rootHrefs


private Link[] roots


private static long serialVersionUID


static java.lang.String[] SERVER
          Specify SERVER as the crawl domain to limit the crawler to visit only pages on the same Web server (hostname and port number) as the root link from which it started.

private int state


static java.lang.String[] SUBTREE
          Specify SUBTREE as the crawl domain to limit the crawler to visit only pages which are descendants of the root link from which it started.

private boolean synchronous


private java.lang.String[] type


private java.util.Hashtable visitedPages


static java.lang.String[] WEB
          Specify WEB as the crawl domain to allow the crawler to visit any page on the World Wide Web.

private Worm[] worms


Constructor Summary

Crawler()
          Make a new Crawler.

Method Summary

void addClassifier(Classifier c)
          Adds a classifier to this crawler.

void addCrawlListener(CrawlListener listen)
          Adds a listener to the set of CrawlListeners for this crawler.

void addLinkListener(LinkListener listen)
          Adds a listener to the set of LinkListeners for this crawler.

void addRoot(Link link)
          Add a root to the existing set of roots.

void clear()
          Initialize the crawler for a fresh crawl.

protected void clearVisited()
          Clear the set of visited links.

java.util.Enumeration enumerateClassifiers()
          Enumerates the set of classifiers.

java.util.Enumeration enumerateQueue()
          Enumerate crawling queue.

void expand(Page page)
          Expand the crawl from a page.

(package private) void fetch(Worm w)


(package private) void fetchTimedOut(Worm w, int interval)


Action getAction()
          Get action.

int getActiveThreads()
          Get number of threads currently working.

Classifier[] getClassifiers()
          Get the set of classifiers

Link[] getCrawledRoots()
          Get roots of last crawl.

boolean getDepthFirst()
          Get depth-first search flag.

java.lang.String[] getDomain()
          Get crawl domain.

DownloadParameters getDownloadParameters()
          Get download parameters (such as number of threads, timeouts, maximum page size, etc.)

boolean getIgnoreVisitedLinks()
          Get ignore-visited-links flag.

LinkPredicate getLinkPredicate()
          Get link predicate.

int getLinksTested()
          Get number of links tested.

java.lang.String[] getLinkType()
          Get legal link types to crawl.

int getMaxDepth()
          Get maximum depth.

java.lang.String getName()
          Get human-readable name of crawler.

PagePredicate getPagePredicate()
          Get page predicate.

int getPagesLeft()
          Get number of pages left to be visited.

int getPagesVisited()
          Get number of pages visited.

java.lang.String getRootHrefs()
          Get starting points of crawl as a String of newline-delimited URLs.

Link[] getRoots()
          Get starting points of crawl as an array of Link objects.

int getState()
          Get state of crawler.

boolean getSynchronous()
          Get synchronous flag.

private void init()


static void main(java.lang.String[] args)


protected void markVisited(Link link)
          Register that a link has been visited.

void pause()
          Pause the crawl in progress.

(package private) void process(Link link)


private void readObject(java.io.ObjectInputStream in)


void removeAllClassifiers()
          Clears the set of classifiers.

void removeClassifier(Classifier c)
          Removes a classifier from the set of classifiers.

void removeCrawlListener(CrawlListener listen)
          Removes a listener from the set of CrawlListeners.

void removeLinkListener(LinkListener listen)
          Removes a listener from the set of LinkListeners.

void run()
          Start crawling.

protected void sendCrawlEvent(int id)
          Send a CrawlEvent to all CrawlListeners registered with this crawler.

protected void sendLinkEvent(Link l, int id)
          Send a LinkEvent to all LinkListeners registered with this crawler.

protected void sendLinkEvent(Link l, int id, java.lang.Throwable exception)
          Send an exceptional LinkEvent to all LinkListeners registered with this crawler.

void setAction(Action act)
          Set the action.

void setDepthFirst(boolean useDFS)
          Set depth-first search flag.

void setDomain(java.lang.String[] domain)
          Set crawl domain.

void setDownloadParameters(DownloadParameters dp)
          Set download parameters (such as number of threads, timeouts, maximum page size, etc.)

void setIgnoreVisitedLinks(boolean f)
          Set ignore-visited-links flag.

void setLinkPredicate(LinkPredicate pred)
          Set link predicate.

void setLinkType(java.lang.String[] type)
          Set legal link types to crawl.

void setMaxDepth(int maxDepth)
          Set maximum depth.

void setName(java.lang.String name)
          Set human-readable name of crawler.

void setPagePredicate(PagePredicate pred)
          Set page predicate.

void setRoot(Link link)
          Set starting point of crawl as a single Link.

void setRootHrefs(java.lang.String hrefs)
          Set starting points of crawl as a string of whitespace-delimited URLs.

void setRoots(Link[] links)
          Set starting points of crawl as an array of Links.

void setSynchronous(boolean f)
          Set ssynchronous flag.

boolean shouldVisit(Link l)
          Callback for testing whether a link should be traversed.

void stop()
          Stop the crawl in progress.

void submit(Link link)
          Puts a link into the crawling queue.

void submit(Link[] links)
          Submit an array of Links for crawling.

(package private) void timedOut()


java.lang.String toString()
          Convert the crawler to a String.

private static java.lang.String[] useStandard(java.lang.String[] standard, java.lang.String[] s)


void visit(Page page)
          Callback for visiting a page.

boolean visited(Link link)
          Test whether the page corresponding to a link has been visited (or queued for visiting).

private void writeObject(java.io.ObjectOutputStream out)


Methods inherited from class java.lang.Object

, clone, equals, finalize, getClass, hashCode, notify, notifyAll, registerNatives, wait, wait, wait

Field Detail

serialVersionUID

private static final long serialVersionUID

WEB

public static final java.lang.String[] WEB

Specify WEB as the crawl domain to allow the crawler to visit any page on the World Wide Web.

SERVER

public static final java.lang.String[] SERVER

Specify SERVER as the crawl domain to limit the crawler to visit only pages on the same Web server (hostname and port number) as the root link from which it started.

SUBTREE

public static final java.lang.String[] SUBTREE

Specify SUBTREE as the crawl domain to limit the crawler to visit only pages which are descendants of the root link from which it started.

HYPERLINKS

public static final java.lang.String[] HYPERLINKS

Specify HYPERLINKS as the link type to allow the crawler to visit only hyperlinks (A, AREA, and FRAME tags which point to http:, ftp:, file:, or gopher: URLs).

HYPERLINKS_AND_IMAGES

public static final java.lang.String[] HYPERLINKS_AND_IMAGES

Specify HYPERLINKS_AND_IMAGES as the link type to allow the crawler to visit only hyperlinks and inline images.

ALL_LINKS

public static final java.lang.String[] ALL_LINKS

Specify ALL_LINKS as the link type to allow the crawler to visit any kind of link

name

private java.lang.String name

roots

private transient Link[] roots

rootHrefs

private java.lang.String[] rootHrefs

domain

private java.lang.String[] domain

synchronous

private boolean synchronous

depthFirst

private boolean depthFirst

type

private java.lang.String[] type

ignoreVisitedLinks

private boolean ignoreVisitedLinks

maxDepth

private int maxDepth

dp

private DownloadParameters dp

classifiers

private java.util.Vector classifiers

linkPredicate

private LinkPredicate linkPredicate

pagePredicate

private PagePredicate pagePredicate

action

private Action action

crawledRoots

private transient Link[] crawledRoots

state

private transient int state

worms

private transient Worm[] worms

fetchQueue

private transient PriorityQueue fetchQueue

crawlQueue

private transient PriorityQueue crawlQueue

numLinksTested

private transient int numLinksTested

numPagesVisited

private transient int numPagesVisited

numPagesLeft

private transient int numPagesLeft

crawlListeners

private transient java.util.Vector crawlListeners

linkListeners

private transient java.util.Vector linkListeners

visitedPages

private transient java.util.Hashtable visitedPages

robotExclusion

private transient RobotExclusion robotExclusion

Constructor Detail

Crawler

public Crawler()

Make a new Crawler.

Method Detail

init

private void init()

writeObject

private void writeObject(java.io.ObjectOutputStream out)
                  throws java.io.IOException

readObject

private void readObject(java.io.ObjectInputStream in)
                 throws java.io.IOException,
                        java.lang.ClassNotFoundException

useStandard

private static java.lang.String[] useStandard(java.lang.String[] standard,
                                              java.lang.String[] s)

run

public void run()

Start crawling. Returns either when the crawl is done, or when pause() or stop() is called. Because this method implements the java.lang.Runnable interface, a crawler can be run in the background thread.

Specified by:: run in interface java.lang.Runnable

clear

public void clear()

Initialize the crawler for a fresh crawl. Clears the crawling queue and sets all crawling statistics to 0. Stops the crawler if it is currently running.

pause

public void pause()

Pause the crawl in progress. If the crawler is running, then it finishes processing the current page, then returns. The queues remain as-is, so calling run() again will resume the crawl exactly where it left off. pause() can be called from any thread.

stop

public void stop()

Stop the crawl in progress. If the crawler is running, then it finishes processing the current page, then returns. Empties the crawling queue.

timedOut

void timedOut()

getState

public int getState()

Get state of crawler.

Returns:: one of CrawlEvent.STARTED, CrawlEvent.PAUSED, STOPPED, CLEARED.

visit

public void visit(Page page)

Callback for visiting a page. Default version does nothing.

Parameters:: page - Page retrieved by the crawler

shouldVisit

public boolean shouldVisit(Link l)

Callback for testing whether a link should be traversed. Default version returns true for all links. Override this method for more interesting behavior.

Parameters:: l - Link encountered by the crawler
Returns:: true if link should be followed, false if it should be ignored.

expand

public void expand(Page page)

Expand the crawl from a page. The default implementation of this method tests every link on the page using shouldVisit (), and submit()s the links that are approved. A subclass may want to override this method if it's inconvenient to consider the links individually with shouldVisit().

Parameters:: page - Page to expand

getPagesVisited

public int getPagesVisited()

Get number of pages visited.

Returns:: number of pages passed to visit() so far in this crawl

getLinksTested

public int getLinksTested()

Get number of links tested.

Returns:: number of links passed to shouldVisit() so far in this crawl

getPagesLeft

public int getPagesLeft()

Get number of pages left to be visited.

Returns:: number of links approved by shouldVisit() but not yet visited

getActiveThreads

public int getActiveThreads()

Get number of threads currently working.

Returns:: number of threads downloading pages

getName

public java.lang.String getName()

Get human-readable name of crawler. Default value is the class name, e.g., "Crawler". Useful for identifying the crawler in a user interface; also used as the default User-agent for identifying the crawler to a remote Web server. (The User-agent can be changed independently of the crawler name with setDownloadParameters().)

Returns:: human-readable name of crawler

setName

public void setName(java.lang.String name)

Set human-readable name of crawler.

Parameters:: name - new name for crawler

toString

public java.lang.String toString()

Convert the crawler to a String.

Overrides:: toString in class java.lang.Object

Returns:: Human-readable name of crawler.

getRoots

public Link[] getRoots()

Get starting points of crawl as an array of Link objects.

Returns:: array of Links from which crawler will start its next crawl.

getCrawledRoots

public Link[] getCrawledRoots()

Get roots of last crawl. May differ from getRoots() if new roots have been set.

Returns:: array of Links from which crawler started its last crawl, or null if the crawler was cleared.

getRootHrefs

public java.lang.String getRootHrefs()

Get starting points of crawl as a String of newline-delimited URLs.

Returns:: URLs where crawler will start, separated by newlines.

setRootHrefs

public void setRootHrefs(java.lang.String hrefs)
                  throws java.net.MalformedURLException

Set starting points of crawl as a string of whitespace-delimited URLs.

Parameters:: hrefs - URLs of starting point, separated by space, \t, or \n
Throws:: java.net.MalformedURLException - if any of the URLs is invalid, leaving starting points unchanged

setRoot

public void setRoot(Link link)

Set starting point of crawl as a single Link.

Parameters:: link - starting point

setRoots

public void setRoots(Link[] links)

Set starting points of crawl as an array of Links.

Parameters:: links - starting points

addRoot

public void addRoot(Link link)

Add a root to the existing set of roots.

Parameters:: link - starting point to add

getDomain

public java.lang.String[] getDomain()

Get crawl domain. Default value is WEB.

Returns:: WEB, SERVER, or SUBTREE.

setDomain

public void setDomain(java.lang.String[] domain)

Set crawl domain.

Parameters:: domain - one of WEB, SERVER, or SUBTREE.

getLinkType

public java.lang.String[] getLinkType()

Get legal link types to crawl. Default value is HYPERLINKS.

Returns:: HYPERLINKS, HYPERLINKS_AND_IMAGES, or ALL_LINKS.

setLinkType

public void setLinkType(java.lang.String[] type)

Set legal link types to crawl.

Parameters:: domain - one of HYPERLINKS, HYPERLINKS_AND_IMAGES, or ALL_LINKS.

getDepthFirst

public boolean getDepthFirst()

Get depth-first search flag. Default value is true.

Returns:: true if search is depth-first, false if search is breadth-first.

setDepthFirst

public void setDepthFirst(boolean useDFS)

Set depth-first search flag. If neither depth-first nor breadth-first is desired, then override shouldVisit() to set a custom priority on each link.

Parameters:: useDFS - true if search should be depth-first, false if search should be breadth-first.

getSynchronous

public boolean getSynchronous()

Get synchronous flag. Default value is false.

Returns:: true if crawler must visit the pages in priority order; false if crawler can visit pages in any order.

setSynchronous

public void setSynchronous(boolean f)

Set ssynchronous flag.

Parameters:: f - true if crawler must visit the pages in priority order; false if crawler can visit pages in any order.

getIgnoreVisitedLinks

public boolean getIgnoreVisitedLinks()

Get ignore-visited-links flag. Default value is true.

Returns:: true if search skips links whose URLs have already been visited (or queued for visiting).

setIgnoreVisitedLinks

public void setIgnoreVisitedLinks(boolean f)

Set ignore-visited-links flag.

Parameters:: f - true if search skips links whose URLs have already been visited (or queued for visiting).

getMaxDepth

public int getMaxDepth()

Get maximum depth. Default value is 5.

Returns:: maximum depth of crawl, in hops from starting point.

setMaxDepth

public void setMaxDepth(int maxDepth)

Set maximum depth.

Parameters:: maxDepth - maximum depth of crawl, in hops from starting point

getDownloadParameters

public DownloadParameters getDownloadParameters()

Get download parameters (such as number of threads, timeouts, maximum page size, etc.)

setDownloadParameters

public void setDownloadParameters(DownloadParameters dp)

Set download parameters (such as number of threads, timeouts, maximum page size, etc.)

Parameters:: dp - Download parameters

setLinkPredicate

public void setLinkPredicate(LinkPredicate pred)

Set link predicate. This is an alternative way to specify the links to walk. If the link predicate is non-null, then only links that satisfy the link predicate AND shouldVisit() are crawled.

Parameters:: pred - Link predicate

getLinkPredicate

public LinkPredicate getLinkPredicate()

Get link predicate.

Returns:: current link predicate

setPagePredicate

public void setPagePredicate(PagePredicate pred)

Set page predicate. This is a way to filter the pages passed to visit(). If the page predicate is non-null, then only pages that satisfy it are passed to visit().

Parameters:: pred - Page predicate

getPagePredicate

public PagePredicate getPagePredicate()

Get page predicate.

Returns:: current page predicate

setAction

public void setAction(Action act)

Set the action. This is an alternative way to specify an action performed on every page. If act is non-null, then every page passed to visit() is also passed to this action.

Parameters:: act - Action

getAction

public Action getAction()

Get action.

Returns:: current action

submit

public void submit(Link link)

Puts a link into the crawling queue. If the crawler is running, the link will eventually be retrieved and passed to visit().

Parameters:: link - Link to put in queue

submit

public void submit(Link[] links)

Submit an array of Links for crawling. If the crawler is running, these links will eventually be retrieved and passed to visit().

Parameters:: links - Links to put in queue

enumerateQueue

public java.util.Enumeration enumerateQueue()

Enumerate crawling queue.

Returns:: an enumeration of Link objects which are waiting to be visited.

addClassifier

public void addClassifier(Classifier c)

Adds a classifier to this crawler. If the classifier is already found in the set, does nothing.

Parameters:: c - a classifier

removeClassifier

public void removeClassifier(Classifier c)

Removes a classifier from the set of classifiers. If c is not found in the set, does nothing.

Parameters:: c - a classifier

removeAllClassifiers

public void removeAllClassifiers()

Clears the set of classifiers.

enumerateClassifiers

public java.util.Enumeration enumerateClassifiers()

Enumerates the set of classifiers.

Returns:: An enumeration of the classifiers.

getClassifiers

public Classifier[] getClassifiers()

Get the set of classifiers

Returns:: An array containing the registered classifiers.

addCrawlListener

public void addCrawlListener(CrawlListener listen)

Adds a listener to the set of CrawlListeners for this crawler. If the listener is already found in the set, does nothing.

Parameters:: listen - a listener

removeCrawlListener

public void removeCrawlListener(CrawlListener listen)

Removes a listener from the set of CrawlListeners. If it is not found in the set, does nothing.

Parameters:: listen - a listener

addLinkListener

public void addLinkListener(LinkListener listen)

Adds a listener to the set of LinkListeners for this crawler. If the listener is already found in the set, does nothing.

Parameters:: listen - a listener

removeLinkListener

public void removeLinkListener(LinkListener listen)

Removes a listener from the set of LinkListeners. If it is not found in the set, does nothing.

Parameters:: listen - a listener

sendCrawlEvent

protected void sendCrawlEvent(int id)

Send a CrawlEvent to all CrawlListeners registered with this crawler.

Parameters:: id - Event id

sendLinkEvent

protected void sendLinkEvent(Link l,
                             int id)

Send a LinkEvent to all LinkListeners registered with this crawler.

Parameters:: l - Link related to event; id - Event id

sendLinkEvent

protected void sendLinkEvent(Link l,
                             int id,
                             java.lang.Throwable exception)

Send an exceptional LinkEvent to all LinkListeners registered with this crawler.

Parameters:: l - Link related to event; id - Event id; exception - Exception associated with event

visited

public boolean visited(Link link)

Test whether the page corresponding to a link has been visited (or queued for visiting).

Parameters:: link - Link to test
Returns:: true if link has been passed to walk() during this crawl

markVisited

protected void markVisited(Link link)

Parameters:: link - Link that has been visited

clearVisited

protected void clearVisited()

Clear the set of visited links.

fetch

void fetch(Worm w)

process

void process(Link link)

fetchTimedOut

void fetchTimedOut(Worm w,
                   int interval)

main

public static void main(java.lang.String[] args)
                 throws java.lang.Exception

Overview

Package

Class

Tree

Deprecated

Index

Help

PREV CLASS NEXT CLASS

FRAMES NO FRAMES

SUMMARY: INNER | FIELD | CONSTR | METHOD

DETAIL: FIELD | CONSTR | METHOD

Field Summary
`private Action`	`action`
`static java.lang.String[]`	`ALL_LINKS` Specify ALL_LINKS as the link type to allow the crawler to visit any kind of link
`private java.util.Vector`	`classifiers`
`private Link[]`	`crawledRoots`
`private java.util.Vector`	`crawlListeners`
`private PriorityQueue`	`crawlQueue`
`private boolean`	`depthFirst`
`private java.lang.String[]`	`domain`
`private DownloadParameters`	`dp`
`private PriorityQueue`	`fetchQueue`
`static java.lang.String[]`	`HYPERLINKS` Specify HYPERLINKS as the link type to allow the crawler to visit only hyperlinks (A, AREA, and FRAME tags which point to http:, ftp:, file:, or gopher: URLs).
`static java.lang.String[]`	`HYPERLINKS_AND_IMAGES` Specify HYPERLINKS_AND_IMAGES as the link type to allow the crawler to visit only hyperlinks and inline images.
`private boolean`	`ignoreVisitedLinks`
`private java.util.Vector`	`linkListeners`
`private LinkPredicate`	`linkPredicate`
`private int`	`maxDepth`
`private java.lang.String`	`name`
`private int`	`numLinksTested`
`private int`	`numPagesLeft`
`private int`	`numPagesVisited`
`private PagePredicate`	`pagePredicate`
`private RobotExclusion`	`robotExclusion`
`private java.lang.String[]`	`rootHrefs`
`private Link[]`	`roots`
`private static long`	`serialVersionUID`
`static java.lang.String[]`	`SERVER` Specify SERVER as the crawl domain to limit the crawler to visit only pages on the same Web server (hostname and port number) as the root link from which it started.
`private int`	`state`
`static java.lang.String[]`	`SUBTREE` Specify SUBTREE as the crawl domain to limit the crawler to visit only pages which are descendants of the root link from which it started.
`private boolean`	`synchronous`
`private java.lang.String[]`	`type`
`private java.util.Hashtable`	`visitedPages`
`static java.lang.String[]`	`WEB` Specify WEB as the crawl domain to allow the crawler to visit any page on the World Wide Web.
`private Worm[]`	`worms`

Method Summary
`void`	`addClassifier(Classifier c)` Adds a classifier to this crawler.
`void`	`addCrawlListener(CrawlListener listen)` Adds a listener to the set of CrawlListeners for this crawler.
`void`	`addLinkListener(LinkListener listen)` Adds a listener to the set of LinkListeners for this crawler.
`void`	`addRoot(Link link)` Add a root to the existing set of roots.
`void`	`clear()` Initialize the crawler for a fresh crawl.
`protected void`	`clearVisited()` Clear the set of visited links.
`java.util.Enumeration`	`enumerateClassifiers()` Enumerates the set of classifiers.
`java.util.Enumeration`	`enumerateQueue()` Enumerate crawling queue.
`void`	`expand(Page page)` Expand the crawl from a page.
`(package private) void`	`fetch(Worm w)`
`(package private) void`	`fetchTimedOut(Worm w, int interval)`
`Action`	`getAction()` Get action.
`int`	`getActiveThreads()` Get number of threads currently working.
`Classifier[]`	`getClassifiers()` Get the set of classifiers
`Link[]`	`getCrawledRoots()` Get roots of last crawl.
`boolean`	`getDepthFirst()` Get depth-first search flag.
`java.lang.String[]`	`getDomain()` Get crawl domain.
`DownloadParameters`	`getDownloadParameters()` Get download parameters (such as number of threads, timeouts, maximum page size, etc.)
`boolean`	`getIgnoreVisitedLinks()` Get ignore-visited-links flag.
`LinkPredicate`	`getLinkPredicate()` Get link predicate.
`int`	`getLinksTested()` Get number of links tested.
`java.lang.String[]`	`getLinkType()` Get legal link types to crawl.
`int`	`getMaxDepth()` Get maximum depth.
`java.lang.String`	`getName()` Get human-readable name of crawler.
`PagePredicate`	`getPagePredicate()` Get page predicate.
`int`	`getPagesLeft()` Get number of pages left to be visited.
`int`	`getPagesVisited()` Get number of pages visited.
`java.lang.String`	`getRootHrefs()` Get starting points of crawl as a String of newline-delimited URLs.
`Link[]`	`getRoots()` Get starting points of crawl as an array of Link objects.
`int`	`getState()` Get state of crawler.
`boolean`	`getSynchronous()` Get synchronous flag.
`private void`	`init()`
`static void`	`main(java.lang.String[] args)`
`protected void`	`markVisited(Link link)` Register that a link has been visited.
`void`	`pause()` Pause the crawl in progress.
`(package private) void`	`process(Link link)`
`private void`	`readObject(java.io.ObjectInputStream in)`
`void`	`removeAllClassifiers()` Clears the set of classifiers.
`void`	`removeClassifier(Classifier c)` Removes a classifier from the set of classifiers.
`void`	`removeCrawlListener(CrawlListener listen)` Removes a listener from the set of CrawlListeners.
`void`	`removeLinkListener(LinkListener listen)` Removes a listener from the set of LinkListeners.
`void`	`run()` Start crawling.
`protected void`	`sendCrawlEvent(int id)` Send a CrawlEvent to all CrawlListeners registered with this crawler.
`protected void`	`sendLinkEvent(Link l, int id)` Send a LinkEvent to all LinkListeners registered with this crawler.
`protected void`	`sendLinkEvent(Link l, int id, java.lang.Throwable exception)` Send an exceptional LinkEvent to all LinkListeners registered with this crawler.
`void`	`setAction(Action act)` Set the action.
`void`	`setDepthFirst(boolean useDFS)` Set depth-first search flag.
`void`	`setDomain(java.lang.String[] domain)` Set crawl domain.
`void`	`setDownloadParameters(DownloadParameters dp)` Set download parameters (such as number of threads, timeouts, maximum page size, etc.)
`void`	`setIgnoreVisitedLinks(boolean f)` Set ignore-visited-links flag.
`void`	`setLinkPredicate(LinkPredicate pred)` Set link predicate.
`void`	`setLinkType(java.lang.String[] type)` Set legal link types to crawl.
`void`	`setMaxDepth(int maxDepth)` Set maximum depth.
`void`	`setName(java.lang.String name)` Set human-readable name of crawler.
`void`	`setPagePredicate(PagePredicate pred)` Set page predicate.
`void`	`setRoot(Link link)` Set starting point of crawl as a single Link.
`void`	`setRootHrefs(java.lang.String hrefs)` Set starting points of crawl as a string of whitespace-delimited URLs.
`void`	`setRoots(Link[] links)` Set starting points of crawl as an array of Links.
`void`	`setSynchronous(boolean f)` Set ssynchronous flag.
`boolean`	`shouldVisit(Link l)` Callback for testing whether a link should be traversed.
`void`	`stop()` Stop the crawl in progress.
`void`	`submit(Link link)` Puts a link into the crawling queue.
`void`	`submit(Link[] links)` Submit an array of Links for crawling.
`(package private) void`	`timedOut()`
`java.lang.String`	`toString()` Convert the crawler to a String.
`private static java.lang.String[]`	`useStandard(java.lang.String[] standard, java.lang.String[] s)`
`void`	`visit(Page page)` Callback for visiting a page.
`boolean`	`visited(Link link)` Test whether the page corresponding to a link has been visited (or queued for visiting).
`private void`	`writeObject(java.io.ObjectOutputStream out)`

websphinx Class Crawler

serialVersionUID

WEB

SERVER

SUBTREE

HYPERLINKS

HYPERLINKS_AND_IMAGES

ALL_LINKS

name

roots

rootHrefs

domain

synchronous

depthFirst

type

ignoreVisitedLinks

maxDepth

dp

classifiers

linkPredicate

pagePredicate

action

crawledRoots

state

worms

fetchQueue

crawlQueue

numLinksTested

numPagesVisited

numPagesLeft

crawlListeners

linkListeners

visitedPages

robotExclusion

Crawler

init

writeObject

readObject

useStandard

run

clear

pause

stop

timedOut

getState

visit

shouldVisit

expand

getPagesVisited

getLinksTested

getPagesLeft

getActiveThreads

getName

setName

toString

getRoots

getCrawledRoots

getRootHrefs

setRootHrefs

setRoot

setRoots

addRoot

getDomain

setDomain

getLinkType

setLinkType

getDepthFirst

setDepthFirst

getSynchronous

setSynchronous

getIgnoreVisitedLinks

setIgnoreVisitedLinks

getMaxDepth

setMaxDepth

getDownloadParameters

setDownloadParameters

setLinkPredicate

getLinkPredicate

setPagePredicate

getPagePredicate

websphinx
Class Crawler