|
|||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | ||||||||
SUMMARY: INNER | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |
java.lang.Object | +--websphinx.Region | +--websphinx.Page
A Web page. Although a Page can represent any MIME type, it mainly supports HTML pages, which are automatically parsed. The parsing produces a list of tags, a list of words, an HTML parse tree, and a list of links.
Field Summary | |
(package private) java.net.URL |
base
|
(package private) java.lang.String |
canonicalTags
|
(package private) java.lang.String |
content
|
(package private) java.lang.String |
contentEncoding
|
(package private) int |
contentLock
|
(package private) java.lang.String |
contentType
|
(package private) Element[] |
elements
|
(package private) long |
expiration
|
private static java.lang.String |
GIF_CODE
Test whether page is a GIF or JPEG image. |
private static java.lang.String |
JPG_CODE
|
(package private) long |
lastModified
|
(package private) Link[] |
links
|
(package private) Link |
origin
|
(package private) int |
responseCode
|
(package private) java.lang.String |
responseMessage
|
(package private) Element |
root
|
(package private) Tag[] |
tags
|
(package private) java.lang.String |
title
|
(package private) Region[] |
tokens
|
(package private) Text[] |
words
|
Fields inherited from class websphinx.Region |
end, INITIAL_SIZE, names, source, start, TRUE |
Constructor Summary | |
Page(Link link)
Make a Page by downloading and parsing a Link. |
|
Page(Link link,
HTMLParser parser)
Make a Page by downloading a Link. |
|
Page(java.lang.String content)
Make a Page from a string of content. |
|
Page(java.net.URL url,
java.lang.String html)
Make a Page from a URL and a string of HTML. |
|
Page(java.net.URL url,
java.lang.String html,
HTMLParser parser)
Make a Page from a URL and a string of HTML. |
Method Summary | |
void |
discardContent()
Unlock the page's content (allowing it to be garbage-collected, to save space during a Web crawl). |
void |
download(HTMLParser parser)
|
(package private) void |
downloadSafely()
|
java.net.URL |
getBase()
Get the base URL, relative to which the page's links were interpreted. |
java.lang.String |
getContent()
Get the content of the page. |
java.lang.String |
getContentEncoding()
Get content encoding of page. |
java.lang.String |
getContentType()
Get MIME type of page. |
int |
getDepth()
Get depth of page in crawl. |
Element[] |
getElements()
Get the HTML elements in the page. |
long |
getExpiration()
Get expiration date of page. |
long |
getLastModified()
Get last-modified date of page. |
Link[] |
getLinks()
Get the links found in the page. |
Link |
getOrigin()
Get the Link that points to this page. |
int |
getResponseCode()
Get response code returned by the Web server. |
java.lang.String |
getResponseMessage()
Get response message returned by the Web server. |
Element |
getRootElement()
Get the root HTML element of the page. |
Tag[] |
getTags()
Get the tag sequence of the page. |
java.lang.String |
getTitle()
Get the title of the page. |
Region[] |
getTokens()
Get the token sequence of the page. |
java.net.URL |
getURL()
Get the URL. |
Text[] |
getWords()
Get the words in the page. |
boolean |
hasContent()
Test if page content is available. |
boolean |
isHTML()
Test whether page is HTML. |
boolean |
isImage()
|
boolean |
isParsed()
Test whether page has been parsed. |
void |
keepContent()
Lock the page's content (to prevent it from being discarded). |
static void |
main(java.lang.String[] args)
|
void |
parse(HTMLParser parser)
Parse the page. |
void |
setContentEncoding(java.lang.String encoding)
Set content encoding of page. |
void |
setContentType(java.lang.String type)
Set MIME type of page. |
void |
setExpiration(long expire)
Set expiration date of page. |
void |
setLastModified(long last)
Set last-modified date of page. |
java.lang.String |
substringCanonicalTags(int start,
int end)
Get canonicalized HTML tags found in a region. |
java.lang.String |
substringContent(int start,
int end)
Get raw content found in a region. |
java.lang.String |
substringHTML(int start,
int end)
Get HTML found in a region. |
java.lang.String |
substringTags(int start,
int end)
Get HTML tags found in a region. |
java.lang.String |
substringText(int start,
int end)
Get tagless text found in a region. |
java.lang.String |
toDescription()
Generate a human-readable description of the page. |
java.lang.String |
toString()
Get page containing the region. |
java.lang.String |
toURL()
Convert the link's URL to a String |
Methods inherited from class websphinx.Region |
enumerateObjectLabels, findEnd, findStart, getEnd, getField, getFields, getLabel, getLabel, getLength, getNumericLabel, getObjectLabel, getObjectLabels, getSource, getStart, hasAllLabels, hasAllLabels, hasAnyLabels, hasAnyLabels, hasLabel, removeLabel, setField, setFields, setLabel, setLabel, setObjectLabel, span, toHTML, toTags, toText |
Methods inherited from class java.lang.Object |
|
Field Detail |
Link origin
long lastModified
long expiration
java.lang.String contentType
java.lang.String contentEncoding
int responseCode
java.lang.String responseMessage
java.net.URL base
java.lang.String title
Link[] links
int contentLock
java.lang.String content
Region[] tokens
Text[] words
Tag[] tags
Element[] elements
Element root
java.lang.String canonicalTags
private static final java.lang.String GIF_CODE
private static final java.lang.String JPG_CODE
Constructor Detail |
public Page(Link link) throws java.io.IOException
link
- Link to downloadpublic Page(Link link, HTMLParser parser) throws java.io.IOException
link
- Link to downloadparser
- HTML parser to usepublic Page(java.net.URL url, java.lang.String html)
url
- URL to use as a base for relative links on the pagehtml
- the HTML content of the pagepublic Page(java.net.URL url, java.lang.String html, HTMLParser parser)
url
- URL to use as a base for relative links on the pagehtml
- the HTML content of the pageparser
- HTML parser to usepublic Page(java.lang.String content)
content
- HTML content of the pageMethod Detail |
public void download(HTMLParser parser) throws java.io.IOException
void downloadSafely()
public void parse(HTMLParser parser)
parser
- HTML parser to usejava.io.IOException
- if an error occurs in downloading the pagepublic boolean isParsed()
public boolean isHTML()
public boolean isImage()
public void keepContent()
public void discardContent()
Links are not considered part of the content, and are not subject to discarding by this method. Also, if the page was created from a string (rather than by downloading), its content is not subject to discarding (since there would be no way to recover it).
public final boolean hasContent()
public int getDepth()
public Link getOrigin()
public java.net.URL getBase()
public java.net.URL getURL()
public java.lang.String getTitle()
public java.lang.String getContent()
public Region[] getTokens()
public Tag[] getTags()
public Text[] getWords()
public Element[] getElements()
public Element getRootElement()
getRootElement
in class Region
public Link[] getLinks()
public java.lang.String toURL()
public java.lang.String toDescription()
public java.lang.String toString()
toString
in class Region
public long getLastModified()
public void setLastModified(long last)
last
- the date when the page was last modified, or 0 if not known.
The value is number of seconds since January 1, 1970 GMTpublic long getExpiration()
public void setExpiration(long expire)
expire
- the expiration date of the page, or 0 if not known.
The value is number of seconds since January 1, 1970 GMT.public java.lang.String getContentType()
public void setContentType(java.lang.String type)
type
- the MIME type of page, such as "text/html", or null if not known.public java.lang.String getContentEncoding()
public void setContentEncoding(java.lang.String encoding)
encoding
- the encoding type of page, such as "base-64", or null if not known.public int getResponseCode()
HttpURLConnection
public java.lang.String getResponseMessage()
public java.lang.String substringContent(int start, int end)
start
- starting offset of regionend
- ending offset of regionpublic java.lang.String substringHTML(int start, int end)
start
- starting offset of regionend
- ending offset of regionpublic java.lang.String substringText(int start, int end)
start
- starting offset of regionend
- ending offset of regionpublic java.lang.String substringTags(int start, int end)
start
- starting offset of regionend
- ending offset of regionpublic java.lang.String substringCanonicalTags(int start, int end)
<tagname#index attr=value attr=value attr=value ...>where tagname and attr are all lowercase, index is the tag's index in the page's tokens array. Attributes are sorted in increasing order by attribute name. Attributes without values omit the entire "=value" portion. Values are delimited by a space. All occurences of <, >, space, and % characters in a value are URL-encoded (e.g., space is converted to %20). Thus the only occurences of these characters in the canonical tag are the tag delimiters.For example, raw HTML that looks like:
<IMG SRC="http://foo.com/map<>.gif" ISMAP>Image</IMG>would be canonicalized to:<img ismap src=http://foo.com/map%3C%3E.gif></img>Comment and declaration tags (whose tag name is !) are omitted from the canonicalization.
start
- starting offset of regionend
- ending offset of regionpublic static void main(java.lang.String[] args) throws java.lang.Exception
|
|||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | ||||||||
SUMMARY: INNER | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |