Package org.htmlparser.lexer
Class Page
java.lang.Object
org.htmlparser.lexer.Page
- All Implemented Interfaces:
Serializable
Represents the contents of an HTML page.
Contains the source of characters and an index of positions of line
separators (actually the first character position on the next line).
- See Also:
-
Field Summary
FieldsModifier and TypeFieldDescriptionstatic final StringThe default charset.static final StringThe default content type.static final charCharacter value when the page is exhausted.protected StringThe base URL for this page.protected URLConnectionThe connection this page is coming from ornull.protected static ConnectionManagerConnection control (proxy, cookies, authorization).protected PageIndexCharacter positions of the first character in each line.protected SourceThe source of characters.protected StringThe URL this page is coming from. -
Constructor Summary
ConstructorsConstructorDescriptionPage()Construct an empty page.Page(InputStream stream, String charset) Construct a page from a stream encoded with the given charset.Construct a page from the given string.Construct a page from the given string.Page(URLConnection connection) Construct a page reading from a URL connection.Construct a page from a source. -
Method Summary
Modifier and TypeMethodDescriptionvoidclose()Close the page by destroying the source of characters.intcolumn(int position) Get the column number for a cursor.intGet the column number for a cursor.constructUrl(String link, String base) Build a URL from the link and base provided using non-strict rules.constructUrl(String link, String base, boolean strict) Build a URL from the link and base provided.protected voidfinalize()Clean up this page, releasing resources.static StringfindCharset(String name, String fallback) Lookup a character set name.getAbsoluteURL(String link) Create an absolute URL from a relative link.getAbsoluteURL(String link, boolean strict) Create an absolute URL from a relative link.Gets the baseUrl.chargetCharacter(Cursor cursor) Read the character at the given cursor position.getCharset(String content) Get a CharacterSet name corresponding to a charset parameter.Get the connection, if any.static ConnectionManagerGet the connection manager all Parsers use.Try and extract the content type from the HTTP header.Get the current encoding being used.getLine(int position) Get the text line the position of the cursor lies on.Get the text line the position of the cursor lies on.Get the source this page is reading from.getText()Get all text read so far from the source.voidgetText(char[] array, int offset, int start, int end) Put the text identified by the given limits into the given array at the specified offset.getText(int start, int end) Get the text identified by the given limits.voidgetText(StringBuffer buffer) Put all text read so far from the source into the given buffer.voidgetText(StringBuffer buffer, int start, int end) Put the text identified by the given limits into the given buffer.getUrl()Get the URL for this page.voidreset()Reset the page by resetting the source of characters.introw(int position) Get the line number for a cursor.intGet the line number for a cursor.voidsetBaseUrl(String url) Sets the baseUrl.voidsetConnection(URLConnection connection) Set the URLConnection to be used by this page.static voidsetConnectionManager(ConnectionManager manager) Set the connection manager to use.voidsetEncoding(String character_set) Begins reading from the source with the given character set.voidSet the URL for this page.toString()Display some of this page as a string.voidungetCharacter(Cursor cursor) Return a character.
-
Field Details
-
DEFAULT_CHARSET
The default charset. This should be"ISO-8859-1", see RFC 2616 (http://www.ietf.org/rfc/rfc2616.txt?number=2616) section 3.7.1Another alias is "8859_1".
- See Also:
-
DEFAULT_CONTENT_TYPE
The default content type. In the absence of alternate information, assume html content ("text/html").- See Also:
-
EOF
public static final char EOFCharacter value when the page is exhausted. Has a value of '\uffff'.- See Also:
-
mUrl
The URL this page is coming from. Cached value ofgetConnection().toExternalForm()orsetUrl(). -
mBaseUrl
The base URL for this page. -
mSource
The source of characters. -
mIndex
Character positions of the first character in each line. -
mConnection
The connection this page is coming from ornull. -
mConnectionManager
Connection control (proxy, cookies, authorization).
-
-
Constructor Details
-
Page
public Page()Construct an empty page. -
Page
Construct a page reading from a URL connection.- Parameters:
connection- A fully conditioned connection. The connect() method will be called so it need not be connected yet.- Throws:
ParserException- An exception object wrapping a number of possible error conditions, some of which are outlined below.- IOException If an i/o exception occurs creating the source.
- UnsupportedEncodingException if the character set specified in the HTTP header is not supported.
-
Page
Construct a page from a stream encoded with the given charset.- Parameters:
stream- The source of bytes.charset- The encoding used. If null, defaults to theDEFAULT_CHARSET.- Throws:
UnsupportedEncodingException- If the given charset is not supported.
-
Page
Construct a page from the given string.- Parameters:
text- The HTML text.charset- Optional. The character set encoding that will be reported bygetEncoding(). If charset isnullthe default character set is used.
-
Page
Construct a page from the given string. The page will report that it is using an encoding ofDEFAULT_CHARSET.- Parameters:
text- The HTML text.
-
Page
Construct a page from a source.- Parameters:
source- The source of characters.
-
-
Method Details
-
getConnectionManager
Get the connection manager all Parsers use.- Returns:
- The connection manager.
-
setConnectionManager
Set the connection manager to use.- Parameters:
manager- The new connection manager.
-
getCharset
Get a CharacterSet name corresponding to a charset parameter.- Parameters:
content- A text line of the form:text/html; charset=Shift_JIS
which is applicable both to the HTTP header field Content-Type and the meta tag http-equiv="Content-Type". Note this method also handles non-compliant quoted charset directives such as:text/html; charset="UTF-8"
andtext/html; charset='UTF-8'
- Returns:
- The character set name to use when reading the input stream. For JDKs that have the Charset class this is qualified by passing the name to findCharset() to render it into canonical form. If the charset parameter is not found in the given string, the default character set is returned.
- See Also:
-
findCharset
Lookup a character set name. Vacuous for JVM's withoutjava.nio.charset. This uses reflection so the code will still run under prior JDK's but in that case the default is always returned.- Parameters:
name- The name to look up. One of the aliases for a character set.fallback- The name to return if the lookup fails.- Returns:
- The character set name.
-
reset
public void reset()Reset the page by resetting the source of characters. -
close
Close the page by destroying the source of characters.- Throws:
IOException- If destroying the source encounters an error.
-
finalize
Clean up this page, releasing resources. Callsclose(). -
getConnection
Get the connection, if any.- Returns:
- The connection object for this page, or null if this page is built from a stream or a string.
-
setConnection
Set the URLConnection to be used by this page. Starts reading from the given connection. This also resets the current url.- Parameters:
connection- The connection to use. It will be connected by this method.- Throws:
ParserException- If theconnect()method fails, or an I/O error occurs opening the input stream or the character set designated in the HTTP header is unsupported.
-
getUrl
Get the URL for this page. This is only available if the page has a connection (getConnection()returns non-null), or the document base has been set via a call tosetUrl().- Returns:
- The url for the connection, or
nullif there is no conenction or the document base has not been set.
-
setUrl
Set the URL for this page. This doesn't affect the contents of the page, just the interpretation of relative links from this point forward.- Parameters:
url- The new URL.
-
getBaseUrl
Gets the baseUrl.- Returns:
- The base URL for this page, or
nullif not set.
-
setBaseUrl
Sets the baseUrl.- Parameters:
url- The base url for this page.
-
getSource
Get the source this page is reading from.- Returns:
- The current source.
-
getContentType
Try and extract the content type from the HTTP header.- Returns:
- The content type.
-
getCharacter
Read the character at the given cursor position. The cursor position can be only behind or equal to the current source position. Returns end of lines (EOL) as \n, by converting \r and \r\n to \n, and updates the end-of-line index accordingly. Advances the cursor position by one (or two in the \r\n case).- Parameters:
cursor- The position to read at.- Returns:
- The character at that position, and modifies the cursor to prepare for the next read. If the source is exhausted a zero is returned.
- Throws:
ParserException- If an IOException on the underlying source occurs, or an attempt is made to read characters in the future (the cursor position is ahead of the underlying stream)
-
ungetCharacter
Return a character. Handles end of lines (EOL) specially, retreating the cursor twice for the '\r\n' case. The cursor position is moved back by one (or two in the \r\n case).- Parameters:
cursor- The position to 'unread' at.- Throws:
ParserException- If an IOException on the underlying source occurs.
-
getEncoding
Get the current encoding being used.- Returns:
- The encoding used to convert characters.
-
setEncoding
Begins reading from the source with the given character set. If the current encoding is the same as the requested encoding, this method is a no-op. Otherwise any subsequent characters read from this page will have been decoded using the given character set.Some magic happens here to obtain this result if characters have already been consumed from this page. Since a Reader cannot be dynamically altered to use a different character set, the underlying stream is reset, a new Source is constructed and a comparison made of the characters read so far with the newly read characters up to the current position. If a difference is encountered, or some other problem occurs, an exception is thrown.
- Parameters:
character_set- The character set to use to convert bytes into characters.- Throws:
ParserException- If a character mismatch occurs between characters already provided and those that would have been returned had the new character set been in effect from the beginning. An exception is also thrown if the underlying stream won't put up with these shenanigans.
-
constructUrl
Build a URL from the link and base provided using non-strict rules.- Parameters:
link- The (relative) URI.base- The base URL of the page, either from the <BASE> tag or, if none, the URL the page is being fetched from.- Returns:
- An absolute URL.
- Throws:
MalformedURLException- If creating the URL fails.- See Also:
-
constructUrl
Build a URL from the link and base provided.- Parameters:
link- The (relative) URI.base- The base URL of the page, either from the <BASE> tag or, if none, the URL the page is being fetched from.strict- Iftruea link starting with '?' is handled according to RFC 2396, otherwise the common interpretation of a query appended to the base is used instead.- Returns:
- An absolute URL.
- Throws:
MalformedURLException- If creating the URL fails.
-
getAbsoluteURL
Create an absolute URL from a relative link.- Parameters:
link- The reslative portion of a URL.- Returns:
- The fully qualified URL or the original link if it was absolute already or a failure occured.
-
getAbsoluteURL
Create an absolute URL from a relative link.- Parameters:
link- The reslative portion of a URL.strict- Iftruea link starting with '?' is handled according to RFC 2396, otherwise the common interpretation of a query appended to the base is used instead.- Returns:
- The fully qualified URL or the original link if it was absolute already or a failure occured.
-
row
Get the line number for a cursor.- Parameters:
cursor- The character offset into the page.- Returns:
- The line number the character is in.
-
row
public int row(int position) Get the line number for a cursor.- Parameters:
position- The character offset into the page.- Returns:
- The line number the character is in.
-
column
Get the column number for a cursor.- Parameters:
cursor- The character offset into the page.- Returns:
- The character offset into the line this cursor is on.
-
column
public int column(int position) Get the column number for a cursor.- Parameters:
position- The character offset into the page.- Returns:
- The character offset into the line this cursor is on.
-
getText
Get the text identified by the given limits.- Parameters:
start- The starting position, zero based.end- The ending position (exclusive, i.e. the character at the ending position is not included), zero based.- Returns:
- The text from
starttoend. - Throws:
IllegalArgumentException- If an attempt is made to get characters ahead of the current source offset (character position).- See Also:
-
getText
Put the text identified by the given limits into the given buffer.- Parameters:
buffer- The accumulator for the characters.start- The starting position, zero based.end- The ending position (exclusive, i.e. the character at the ending position is not included), zero based.- Throws:
IllegalArgumentException- If an attempt is made to get characters ahead of the current source offset (character position).
-
getText
Get all text read so far from the source.- Returns:
- The text from the source.
- See Also:
-
getText
Put all text read so far from the source into the given buffer.- Parameters:
buffer- The accumulator for the characters.- See Also:
-
getText
Put the text identified by the given limits into the given array at the specified offset.- Parameters:
array- The array of characters.offset- The starting position in the array where characters are to be placed.start- The starting position, zero based.end- The ending position (exclusive, i.e. the character at the ending position is not included), zero based.- Throws:
IllegalArgumentException- If an attempt is made to get characters ahead of the current source offset (character position).
-
getLine
Get the text line the position of the cursor lies on.- Parameters:
cursor- The position to calculate for.- Returns:
- The contents of the URL or file corresponding to the line number containing the cursor position.
-
getLine
Get the text line the position of the cursor lies on.- Parameters:
position- The position to calculate for.- Returns:
- The contents of the URL or file corresponding to the line number containg the cursor position.
-
toString
Display some of this page as a string.
-