A
Regexp
object represents a regular expression and is used for pattern
matching in text.
The fields in a
Regexp
are:
| pattern |
A
String
that specifies the regular expression pattern.
Pattern syntax is discussed below.
| | type |
An
int
that contains zero or more of the following flags:
| CASE_INSENSITIVE |
Indicates that pattern matching should be case insensitive.
| | SHELL_PATTERN |
Indicates that the
pattern should be interpreted using shell pattern matching conventions.
Pattern syntax is discussed below.
| | RAWSHELL_PATTERN |
Like
SHELL_PATTERN,
this flag means the pattern should be interpreted using shell pattern matching conventions,
but the difference in this case is that a backslash character has no special meaning whereas for
SHELL_PATTERN
a backslash can be used to escape or negate the special meaning of the character that follows it.
| | TEXT_PATTERN |
Indicates that the
pattern should be interpreted literally with no characters having a special
meaning as there are in the other pattern instances.
| | AUTO_PATTERN |
Indicates that the
pattern should be interpreted as one or more
TEXT_PATTERN,
SHELL_PATTERN
or
RAWSHELL_PATTERN
sub-patterns each separated by a pair of vertical bars (pipes), which indicates an
OR
operation should apply among all the sub-patterns.
When a sub-pattern contains an asterisk, a question mark or a square-bracketed
character class, the sub-pattern is considered either a
SHELL_PATTERN
or a
RAWSHELL_PATTERN,
that latter case occurring when the
RAWSHELL_PATTERN
constant is included in the
type
flag.
Otherwise, the sub-pattern is considered to be a simple
TEXT_PATTERN.
When a sub-pattern is determined to be either type of shell pattern ,
then a backslash can be used to escape the meaning of the
asterisk, question mark or square-brackets.
| | SINGLE_BYTE |
Indicates that character classes should be
restricted to one byte.
When pattern matching is against Unicode character sets whose second byte
is always zero it is more efficient to restrict character classes to a
single byte since that is all that is needed.
The standard encoding for the US, ISO8859_1, is an example of case where
single byte character classes would be appropriate.
Character classes are discussed below.
|
The default value is
SINGLE_BYTE.
|
Regular expressions used in Yoix can take one of two forms: either standard
regular expression syntax or the simplified syntax commonly associated with
shell pattern matching.
For standard regular expressions, the following syntax applies:
-
Any character, except characters with special syntactic meaning in a
standard regular expression, matches one occurrence of that character.
A backslash
(
\)
can be used to remove the special syntactic meaning of the character that
follows the backslash.
-
A dot or period
(.)
matches one occurrence of any character.
-
A non-empty string,
s,
enclosed in brackets, namely
[s],
is called a character class and matches one occurrence of
any of the characters represented by
s.
A substring in
s
of the form
x-y,
where
x
is lexicographically before
y,
represents all characters in the range of
x
through
y,
inclusive.
A right bracket
(])
it should be the first character in
s,
while a dash
(-)
should be the first or last characters in
s.
A caret,
(^)
can be used as follows,
[^s],
to indicate a character class that matches one occurrence of any of
the characters not represented by
s.
-
An asterisk
(*)
indicates that the regular expression component immediately preceding it
applies to zero or more match occurrences of that component.
-
An plus
(+)
indicates that the regular expression component immediately preceding it
applies to one or more match occurrences of that component.
-
An question mark
(?)
indicates that the regular expression component immediately preceding it
applies to zero or one match occurrences of that component.
-
Parentheses can be used to group together several regular expression
components into a subexpression that is treated as a single regular
expression component.
-
A vertical bar
(|)
separating two regular expression components becomes a single regular
expression component in which either of the original two regular expression
components may cause a match occurrence.
-
A caret
(^)
at the start of a regular expression constrains the regular expression to
matches occurring at the beginning of the target text.
A dollar sign
($)
at the end of a regular expression constrains the regular expression to
matches occurring at the end of the target text.
Standard regular expression match the longest match, starting from the left, in the target text.
Simplified, or shell pattern, regular expressions have the following syntax:
-
Any character, except characters with special syntactic meaning in a
simplified regular expression, matches one occurrence of that character.
A backslash
(
\)
can be used to remove the special syntactic meaning of the character that
follows the backslash.
-
A question mark
(?)
matches one occurrence of any character.
-
An asterisk
(*)
matches zero or more occurrences of any character.
-
A non-empty string,
s,
enclosed in brackets, namely
[s],
is called a character class and matches one occurrence of
any of the characters represented by
s.
A substring in
s
of the form
x-y,
where
x
is lexicographically before
y,
represents all characters in the range of
x
through
y,
inclusive.
To represent a right bracket
(])
it should be the first character in
s.
To represent a dash
(-)
it should be the last character in
s.
An exclamation point
(!)
can be used as follows,
[!s],
to indicate a character class that matches one occurrence of any of
the characters not represented by
s.
A simplified regular expression must match the entire target text to be
considered a successful match.
Character classes describe a set of characters enclosed in
square brackets,
([]),
and, normally, will match
any single character from the set.
If the first character in the set is a caret
(^),
in the case of standard regular expressions, or an exclamation point
(!),
in the case of shell-pattern expressions, then any single character
not
in the set will be matched.
Separating two characters by a dash
(-)
indicates all the Unicode characters between and including those two
characters.
To include a right square bracket
(]),
it should be the first character in the set (after a possible caret).
To include a dash,
it should be the last character in the set.
When matching a pattern, the longest of all possible matches is used.
| |
| Example: |
The program,
import yoix.*.*;
Regexp re;
String line;
Stream page =
open("http://www.research.att.com/sw/tools/yoix/", "r");
re.pattern = "^[k-n]|(the)";
while (line = page.nextline) {
if (regexec(re, line))
puts(line);
}
will print to stdout all lines in the HTML source of the Yoix home page
that either begin with the letters "k" through "n" or contain the
letters "the" together.
| | |
| See Also: |
gsubsti,
gvsubsti,
regexec,
regexp,
regsub,
Subexp,
substi,
vsubsti
|
|