|
|||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | ||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |
java.lang.Object | +--websphinx.Crawler
Web crawler.
To write a crawler, extend this class and override shouldVisit () create your own crawler. You can also modify this file.
To use a crawler:
Constructor Summary | |
Crawler()
Make a new Crawler. |
Method Summary | |
void |
addCrawlListener(websphinx.CrawlListener listen)
Adds a listener to the set of CrawlListeners for this crawler. |
void |
addLinkListener(websphinx.LinkListener listen)
Adds a listener to the set of LinkListeners for this crawler. |
protected void |
clearVisited()
Clear the set of visited links. |
void |
expand(websphinx.Page page)
Expand the crawl from a page. |
int |
getActiveThreads()
Get number of threads currently working. |
websphinx.DownloadParameters |
getDownloadParameters()
Get download parameters (such as number of threads, timeouts, maximum page size, etc.) |
boolean |
getIgnoreVisitedLinks()
Get ignore-visited-links flag. |
int |
getMaxDepth()
Get maximum depth. |
java.lang.String |
getName()
Get human-readable name of crawler. |
int |
getPagesLeft()
Get number of pages left to be visited. |
int |
getPagesVisited()
Get number of pages visited. |
websphinx.Link |
getRoot()
|
protected void |
markVisited(websphinx.Link link)
Register that a CRC32 value of link's URL has been visited. |
void |
printStatus(java.io.PrintStream out)
Print current status |
void |
run()
Start crawling. |
protected void |
sendLinkEvent(websphinx.Link l,
int id)
Send a LinkEvent to all LinkListeners registered with this crawler. |
protected void |
sendLinkEvent(websphinx.Link l,
int id,
java.lang.Throwable exception)
Send an exceptional LinkEvent to all LinkListeners registered with this crawler. |
void |
setDownloadParameters(websphinx.DownloadParameters dp)
Set download parameters (such as number of threads, timeouts, maximum page size, etc.) |
void |
setHostRoot(java.lang.String name)
Set the host name of the root so that the crawler only visits root's family web sites. |
void |
setIgnoreVisitedLinks(boolean f)
Set ignore-visited-links flag. |
void |
setMaxDepth(int maxDepth)
Set maximum depth. |
void |
setName(java.lang.String name)
Set human-readable name of crawler. |
void |
setRoot(websphinx.Link link)
Set starting point of crawl as a single Link. |
boolean |
shouldVisit(websphinx.Link l)
Callback for testing whether a link should be traversed. |
void |
stop()
stop crawling |
void |
submit(websphinx.Link link)
Puts a link into the crawling queue. |
java.lang.String |
toString()
Convert the crawler to a String. |
boolean |
visited(websphinx.Link link)
Test whether the page corresponding to a link has been visited (or queued for visiting). |
Methods inherited from class java.lang.Object |
clone, equals, finalize, getClass, hashCode, notify, notifyAll, wait, wait, wait |
Constructor Detail |
public Crawler()
Method Detail |
public void printStatus(java.io.PrintStream out)
out
- PrintStream which stream to printpublic void run() throws java.lang.Exception
java.lang.Exception
public void stop()
public boolean shouldVisit(websphinx.Link l)
l
- Link encountered by the crawler
public void expand(websphinx.Page page)
page
- Page to expandpublic int getPagesVisited()
public int getPagesLeft()
public int getActiveThreads()
public java.lang.String getName()
public void setName(java.lang.String name)
name
- new name for crawlerpublic java.lang.String toString()
toString
in class java.lang.Object
public websphinx.Link getRoot()
public void setRoot(websphinx.Link link)
link
- starting pointpublic void setHostRoot(java.lang.String name)
name
- root's host name(eg. washington.edu)public boolean getIgnoreVisitedLinks()
public void setIgnoreVisitedLinks(boolean f)
f
- true if search skips links whose URLs have already been visited
(or queued for visiting).public int getMaxDepth()
public void setMaxDepth(int maxDepth)
maxDepth
- maximum depth of crawl, in hops from starting pointpublic websphinx.DownloadParameters getDownloadParameters()
public void setDownloadParameters(websphinx.DownloadParameters dp)
dp
- Download parameterspublic void addCrawlListener(websphinx.CrawlListener listen)
listen
- a listenerpublic void addLinkListener(websphinx.LinkListener listen)
listen
- a listenerpublic void submit(websphinx.Link link)
link
- Link to put in queueprotected void sendLinkEvent(websphinx.Link l, int id)
l
- Link related to eventid
- Event idprotected void sendLinkEvent(websphinx.Link l, int id, java.lang.Throwable exception)
l
- Link related to eventid
- Event idexception
- Exception associated with eventpublic boolean visited(websphinx.Link link)
link
- Link to test
protected void markVisited(websphinx.Link link)
link
- Link that has been visitedprotected void clearVisited()
|
|||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | ||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |