Package org.htmlparser.filters
Class RegexFilter
java.lang.Object
org.htmlparser.filters.RegexFilter
- All Implemented Interfaces:
Serializable,Cloneable,NodeFilter
This filter accepts all string nodes matching a regular expression.
Because this searches
Text nodes. it is
only useful for finding small fragments of text, where it is
unlikely to be broken up by a tag. To find large fragments of text
you should convert the page to plain text with something like the
StringBean and then apply
the regular expression.
For example, to look for dates use:
(19|20)\d\d([- \\/.](0[1-9]|1[012])[- \\/.](0[1-9]|[12][0-9]|3[01]))?as in:
Parser parser = new Parser ("http://cbc.ca");
RegexFilter filter = new RegexFilter ("(19|20)\\d\\d([- \\\\/.](0[1-9]|1[012])[- \\\\/.](0[1-9]|[12][0-9]|3[01]))?");
NodeIterator iterator = parser.extractAllNodesThatMatch (filter).elements ();
which matches a date in yyyy-mm-dd format between 1900-01-01 and 2099-12-31,
with a choice of five separators, either a dash, a space, either kind of
slash or a period.
The year is matched by (19|20)\d\d which uses alternation to allow the
either 19 or 20 as the first two digits. The round brackets are mandatory.
The month is matched by 0[1-9]|1[012], again enclosed by round brackets
to keep the two options together. By using character classes, the first
option matches a number between 01 and 09, and the second
matches 10, 11 or 12.
The last part of the regex consists of three options. The first matches
the numbers 01 through 09, the second 10 through 29, and the third matches 30 or 31.
The day and month are optional, but must occur together because of the ()?
bracketing after the year.- See Also:
-
Field Summary
FieldsModifier and TypeFieldDescriptionstatic final intUse find() match strategy.static final intUse lookingAt() match strategy.static final intUse match() matching strategy.protected PatternThe compiled regular expression to search for.protected StringThe regular expression to search for.protected intThe match strategy. -
Constructor Summary
ConstructorsConstructorDescriptionCreates a new instance of RegexFilter that accepts string nodes matching the regular expression ".*" using the FIND strategy.RegexFilter(String pattern) Creates a new instance of RegexFilter that accepts string nodes matching a regular expression using the FIND strategy.RegexFilter(String pattern, int strategy) Creates a new instance of RegexFilter that accepts string nodes matching a regular expression. -
Method Summary
Modifier and TypeMethodDescriptionbooleanAccept string nodes that match the regular expression.Get the search pattern.intGet the search strategy.voidsetPattern(String pattern) Set the search pattern.voidsetStrategy(int strategy) Set the search pattern.
-
Field Details
-
MATCH
public static final int MATCHUse match() matching strategy.- See Also:
-
LOOKINGAT
public static final int LOOKINGATUse lookingAt() match strategy.- See Also:
-
FIND
public static final int FINDUse find() match strategy.- See Also:
-
mPatternString
The regular expression to search for. -
mPattern
The compiled regular expression to search for. -
mStrategy
protected int mStrategyThe match strategy.- See Also:
-
-
Constructor Details
-
RegexFilter
public RegexFilter()Creates a new instance of RegexFilter that accepts string nodes matching the regular expression ".*" using the FIND strategy. -
RegexFilter
Creates a new instance of RegexFilter that accepts string nodes matching a regular expression using the FIND strategy.- Parameters:
pattern- The pattern to search for.
-
RegexFilter
Creates a new instance of RegexFilter that accepts string nodes matching a regular expression.- Parameters:
pattern- The pattern to search for.strategy- The type of match:MATCHuse matches() method: attempts to match the entire input sequence against the patternLOOKINGATuse lookingAt() method: attempts to match the input sequence, starting at the beginning, against the patternFINDuse find() method: scans the input sequence looking for the next subsequence that matches the pattern
-
-
Method Details
-
getPattern
Get the search pattern.- Returns:
- Returns the pattern.
-
setPattern
Set the search pattern.- Parameters:
pattern- The pattern to set.
-
getStrategy
public int getStrategy()Get the search strategy.- Returns:
- Returns the strategy.
-
setStrategy
public void setStrategy(int strategy) Set the search pattern.- Parameters:
strategy- The strategy to use. One of MATCH, LOOKINGAT or FIND.
-
accept
Accept string nodes that match the regular expression.- Specified by:
acceptin interfaceNodeFilter- Parameters:
node- The node to check.- Returns:
trueif the regular expression matches the text of the node,falseotherwise.
-