websphinx
Class Link

java.lang.Object
  |
  +--websphinx.Region
        |
        +--websphinx.Element
              |
              +--websphinx.Link
All Implemented Interfaces:
Prioritized
Direct Known Subclasses:
Form, FormButton

public class Link
extends Element
implements Prioritized

Link to a Web page.

See Also:
Page

Field Summary
private  int depth
           
private  java.lang.String directory
           
private  DownloadParameters dp
           
private  java.lang.String filename
           
static int GET
          Use the HTTP GET method to download this link.
private  Page page
           
static int POST
          Use the HTTP POST method to access this link.
private  float priority
           
private  java.lang.String query
           
private  java.lang.String ref
           
private  int status
           
private  java.lang.String text
           
protected  java.net.URL url
           
 
Fields inherited from class websphinx.Element
child, endTag, parent, sibling, startTag
 
Fields inherited from class websphinx.Region
end, INITIAL_SIZE, names, source, start, TRUE
 
Constructor Summary
Link(java.io.File file)
          Make a Link from a File.
Link(java.lang.String href)
          Make a Link from a string URL.
Link(Tag startTag, Tag endTag, java.net.URL base)
          Make a Link from a start tag and end tag and a base URL (for relative references).
Link(java.net.URL url)
          Make a Link from a URL.
 
Method Summary
 void discardContent()
          Eliminate all references to page content.
 void disconnect()
          Disconnect this link from its downloaded page (throwing away the page).
static java.net.URL FileToURL(java.io.File file)
          Convert a local filename to a URL.
 int getDepth()
          Get depth of link in crawl.
 java.lang.String getDirectory()
          Get the directory part of the link, like "/home/dir/".
 java.net.URL getDirectoryURL()
          Get the URL of a page's directory.
static java.net.URL getDirectoryURL(java.net.URL url)
          Get the URL of a page's directory.
 DownloadParameters getDownloadParameters()
          Get the download parameters used for this link.
 java.lang.String getFile()
          Get the information part of the link, like "/home/dir/index.html?query".
 java.lang.String getFilename()
          Get the filename part of the link, like "index.html".
 java.lang.String getHost()
          Get the hostname of the link, like "www.cs.cmu.edu".
private static java.lang.String getHrefAttributeName(Tag tag)
           
 int getMethod()
          Get the method used to access this link.
 Page getPage()
          Get the downloaded page to which the link points.
 java.net.URL getPageURL()
          Get the URL of a page, omitting any anchor reference (like #ref).
static java.net.URL getPageURL(java.net.URL url)
          Get the URL of a page, omitting any anchor reference (like #ref).
 java.net.URL getParentURL()
          Get the URL of a page's parent directory.
static java.net.URL getParentURL(java.net.URL url)
          Get the URL of a page's parent directory.
 int getPort()
          Get the port number of the link.
 float getPriority()
          Get the priority of the link in the crawl.
 java.lang.String getProtocol()
          Get the network protocol of the link, like "ftp" or "http".
 java.lang.String getQuery()
          Get the query part of the link, like "?query".
 java.lang.String getRef()
          Get the anchor reference of the link, like "#ref".
 java.net.URL getServiceURL()
          Get the URL of a Web service, omitting any query or anchor reference.
static java.net.URL getServiceURL(java.net.URL url)
          Get the URL of a Web service, omitting any query or anchor reference.
 int getStatus()
          Get the status of the link.
 java.net.URL getURL()
          Get the URL.
private  void parseURL()
           
private static java.lang.String relativeTo(java.lang.String here, java.lang.String there)
           
static java.lang.String relativeTo(java.net.URL here, java.lang.String there)
           
static java.lang.String relativeTo(java.net.URL here, java.net.URL there)
           
 Tag replaceHref(java.lang.String newHref)
          Copy the link's start tag, replacing the URL.
 void setDownloadParameters(DownloadParameters dp)
          Set the download parameters used for this link.
 void setPage(Page page)
          Set the page corresponding to this link.
 void setPriority(float priority)
          Set the priority of the link in the crawl.
 void setStatus(int event)
          Set the status of the link.
 void setText(java.lang.String text)
          Set the tagless-text representation of this region.
 java.lang.String toDescription()
          Generate a human-readable description of the link.
 java.lang.String toText()
          Convert the region to tagless text.
 java.lang.String toURL()
          Convert the link's URL to a String
static java.lang.String toURLDelimiters(java.lang.String path)
           
protected  java.net.URL urlFromHref(Tag tag, java.net.URL base)
          Construct the URL for a link element, from its start tag and a base URL (for relative references).
static java.io.File URLToFile(java.net.URL url)
          Convert a file: URL to a filename appropriate to the current system platform.
 
Methods inherited from class websphinx.Element
enumerateHTMLAttributes, getChild, getEndTag, getHTMLAttribute, getHTMLAttribute, getNext, getParent, getSibling, getStartTag, getTagName, hasHTMLAttribute
 
Methods inherited from class websphinx.Region
enumerateObjectLabels, findEnd, findStart, getEnd, getField, getFields, getLabel, getLabel, getLength, getNumericLabel, getObjectLabel, getObjectLabels, getRootElement, getSource, getStart, hasAllLabels, hasAllLabels, hasAnyLabels, hasAnyLabels, hasLabel, removeLabel, setField, setFields, setLabel, setLabel, setObjectLabel, span, toHTML, toString, toTags
 
Methods inherited from class java.lang.Object
, clone, equals, finalize, getClass, hashCode, notify, notifyAll, registerNatives, wait, wait, wait
 

Field Detail

url

protected java.net.URL url

directory

private java.lang.String directory

filename

private java.lang.String filename

query

private java.lang.String query

ref

private java.lang.String ref

page

private Page page

depth

private int depth

text

private java.lang.String text

status

private int status

priority

private float priority

dp

private DownloadParameters dp

GET

public static final int GET
Use the HTTP GET method to download this link.

POST

public static final int POST
Use the HTTP POST method to access this link.
Constructor Detail

Link

public Link(Tag startTag,
            Tag endTag,
            java.net.URL base)
     throws java.net.MalformedURLException
Make a Link from a start tag and end tag and a base URL (for relative references). The tags must be on the same page.
Parameters:
startTag - Start tag of element
endTag - End tag of element
base - Base URL used for relative references

Link

public Link(java.net.URL url)
Make a Link from a URL.

Link

public Link(java.io.File file)
     throws java.net.MalformedURLException
Make a Link from a File.

Link

public Link(java.lang.String href)
     throws java.net.MalformedURLException
Make a Link from a string URL.
Throws:
java.net.MalformedURLException - if the URL is invalid
Method Detail

discardContent

public void discardContent()
Eliminate all references to page content.

disconnect

public void disconnect()
Disconnect this link from its downloaded page (throwing away the page).

getDepth

public int getDepth()
Get depth of link in crawl.
Returns:
depth of link from root (depth of roots is 0)

getURL

public java.net.URL getURL()
Get the URL.
Returns:
the URL of the link

getProtocol

public java.lang.String getProtocol()
Get the network protocol of the link, like "ftp" or "http".
Returns:
the protocol portion of the link's URL

getHost

public java.lang.String getHost()
Get the hostname of the link, like "www.cs.cmu.edu".
Returns:
the hostname portion of the link's URL

getPort

public int getPort()
Get the port number of the link.
Returns:
the port number of the link's URL, or -1 if no port number is explicitly specified in the URL

getFile

public java.lang.String getFile()
Get the information part of the link, like "/home/dir/index.html?query". Equivalent to getURL().getFile().
Returns:
the filename portion of the link's URL

getDirectory

public java.lang.String getDirectory()
Get the directory part of the link, like "/home/dir/". Always starts and ends with '/'.
Returns:
the directory portion of the link's URL

getFilename

public java.lang.String getFilename()
Get the filename part of the link, like "index.html". Never contains '/'; may be the empty string.
Returns:
the filename portion of the link's URL

getQuery

public java.lang.String getQuery()
Get the query part of the link, like "?query". Either starts with a '?', or is empty.
Returns:
the query portion of the link's URL

getRef

public java.lang.String getRef()
Get the anchor reference of the link, like "#ref". Either starts with '#', or is empty.
Returns:
the anchor reference portion of the link's URL

getPageURL

public java.net.URL getPageURL()
Get the URL of a page, omitting any anchor reference (like #ref).
Returns:
the URL sans anchor reference

getPageURL

public static java.net.URL getPageURL(java.net.URL url)
Get the URL of a page, omitting any anchor reference (like #ref).
Returns:
the URL sans anchor reference

getServiceURL

public java.net.URL getServiceURL()
Get the URL of a Web service, omitting any query or anchor reference.
Returns:
the URL sans query and anchor reference

getServiceURL

public static java.net.URL getServiceURL(java.net.URL url)
Get the URL of a Web service, omitting any query or anchor reference.
Returns:
the URL sans query and anchor reference

getDirectoryURL

public java.net.URL getDirectoryURL()
Get the URL of a page's directory.
Returns:
the URL sans filename, query and anchor reference

getDirectoryURL

public static java.net.URL getDirectoryURL(java.net.URL url)
Get the URL of a page's directory.
Returns:
the URL sans filename, query and anchor reference

getParentURL

public java.net.URL getParentURL()
Get the URL of a page's parent directory.
Returns:
the URL sans filename, query and anchor reference

getParentURL

public static java.net.URL getParentURL(java.net.URL url)
Get the URL of a page's parent directory.
Returns:
the URL sans filename, query and anchor reference

relativeTo

public static java.lang.String relativeTo(java.net.URL here,
                                          java.net.URL there)

relativeTo

public static java.lang.String relativeTo(java.net.URL here,
                                          java.lang.String there)

relativeTo

private static java.lang.String relativeTo(java.lang.String here,
                                           java.lang.String there)

FileToURL

public static java.net.URL FileToURL(java.io.File file)
                              throws java.net.MalformedURLException
Convert a local filename to a URL. For example, if the filename is "C:\FOO\BAR\BAZ", the resulting URL is "file:/C:/FOO/BAR/BAZ".
Parameters:
file - File to convert
Returns:
URL corresponding to file

URLToFile

public static java.io.File URLToFile(java.net.URL url)
                              throws java.net.MalformedURLException
Convert a file: URL to a filename appropriate to the current system platform. For example, on MS Windows, if the URL is "file:/FOO/BAR/BAZ", the resulting filename is "\FOO\BAR\BAZ".
Parameters:
url - URL to convert
Returns:
File corresponding to url
Throws:
java.net.MalformedURLException - if url is not a file: URL.

toURLDelimiters

public static java.lang.String toURLDelimiters(java.lang.String path)

getPage

public Page getPage()
Get the downloaded page to which the link points.
Returns:
the Page object, or null if the page hasn't been downloaded.

setPage

public void setPage(Page page)
Set the page corresponding to this link.
Parameters:
page - Page to which this link points

getMethod

public int getMethod()
Get the method used to access this link.
Returns:
GET or POST.

toURL

public java.lang.String toURL()
Convert the link's URL to a String
Returns:
the URL represented as a string

toDescription

public java.lang.String toDescription()
Generate a human-readable description of the link.
Returns:
a description of the link, in the form "[url]".

toText

public java.lang.String toText()
Convert the region to tagless text.
Overrides:
toText in class Region
Returns:
a string consisting of the text in the page contained by this region

setText

public void setText(java.lang.String text)
Set the tagless-text representation of this region.
Parameters:
text - a string consisting of the text in the page contained by this region

parseURL

private void parseURL()

urlFromHref

protected java.net.URL urlFromHref(Tag tag,
                                   java.net.URL base)
                            throws java.net.MalformedURLException
Construct the URL for a link element, from its start tag and a base URL (for relative references).
Parameters:
tag - Start tag of link, such as <A HREF="/foo/index.html">.
base - Base URL used for relative references
Returns:
URL to which the link points

replaceHref

public Tag replaceHref(java.lang.String newHref)
Copy the link's start tag, replacing the URL. Note that the name of the attribute containing the URL varies from tag to tag: sometimes it is called HREF, sometimes SRC, sometimes CODE, etc. This method changes the appropriate attribute for this tag.
Parameters:
newHref - New URL or relative reference; e.g. "http://www.cs.cmu.edu/" or "/foo/index.html".
Returns:
copy of this link's start tag with its URL attribute replaced. The copy is a region of a fresh page containing only the tag.

getHrefAttributeName

private static java.lang.String getHrefAttributeName(Tag tag)

getStatus

public int getStatus()
Get the status of the link. Possible values are defined in LinkEvent.
Returns:
last event that happened to this link

setStatus

public void setStatus(int event)
Set the status of the link. Possible values are defined in LinkEvent.
Parameters:
event - the event that just happened to this link

getPriority

public float getPriority()
Get the priority of the link in the crawl.
Specified by:
getPriority in interface Prioritized

setPriority

public void setPriority(float priority)
Set the priority of the link in the crawl.

getDownloadParameters

public DownloadParameters getDownloadParameters()
Get the download parameters used for this link. Default is null.

setDownloadParameters

public void setDownloadParameters(DownloadParameters dp)
Set the download parameters used for this link.