|
regex implementation categorization
Glenn Fowler <gsf@research.att.com>
AT&T Research - Florham Park NJ
The
regex
tests in
categorize.dat
attempt to categorize
regex
implementations.
The tests do not address internationalization.
All implementations report the leftmost match; this is omitted from the table.
| LABEL | ASSOC | SUBEXPR | REP_LONGEST | BUGS |
|
A | right | precedence | first | - |
|
B | right | grouping | first | repeat-null repeat-short repeat-artifact-nomatch |
|
D | right | grouping | first | - |
|
G | right | grouping | first | alternation-order repeat-null repeat-artifact repeat-artifact-nomatch |
|
H | right | grouping | first | alternation-order nomatch-match repeat-null repeat-artifact repeat-artifact-nomatch |
|
I | right | grouping | first | repeat-any repeat-short repeat-artifact-nomatch |
|
J | right | precedence | last | nomatch-match repeat-artifact repeat-artifact-nomatch subexpression-first |
|
M | right | precedence | last | range-null repeat-artifact repeat-artifact-nomatch subexpression-first |
|
O | right | grouping | first | repeat-null repeat-short repeat-artifact-nomatch |
|
P | right | grouping | first | alternation-order first-match repeat-null repeat-artifact |
|
R | left | precedence | last | - |
|
S | right | grouping | first | repeat-null repeat-short repeat-artifact-nomatch |
|
T | left | precedence | last | - |
|
U | right | precedence | first | repeat-null |
|
darwin.ppc | right | grouping | first | repeat-null repeat-short |
|
freebsd.i386 | right | grouping | first | repeat-null repeat-short |
|
hp.pa | right | grouping | first | repeat-artifact |
|
ibm.risc | right | grouping | first | alternation-order nomatch-match repeat-artifact repeat-artifact-nomatch |
|
linux.i386 | right | grouping | first | alternation-order repeat-artifact repeat-null |
|
sgi.mips3 | right | grouping | first | repeat-short |
|
sol8.sun4 | right | grouping | first | alternation-order nomatch-match repeat-artifact |
|
unixware.i386 | right | precedence | first | repeat-null |
|
The categories are:
- LABEL
-
The implementation label from
testregex.
- ASSOC
-
Subpattern (or atom) associativity: either
left
or
right.
The subexpression match rule in the rationale requires
right
for expressions where each concatenated part is a subexpression.
There is no definition for
subpattern,
but it would be inconsistent for any definition to require different
associativity than that for subexpressions.
Some claim that the BRE and ERE grammars specify
left
associativity, but this interpretation disregards
the subexpression match rule in the rationale.
The grammar can also be interpreted to support
right
associativity, and this interpretation is in accord with the rationale.
- SUBEXPR
-
Subexpression semantics:
precedence
if subexpressions can override the default associativity;
grouping
if subexpressions are for repetition and
regmatch_t
grouping only.
The subexpression match rule in the rationale requires
precedence.
- REP_LONGEST
-
How repeated subexpressions that match more than once are handled:
first
if the longest possible matches occur first;
last
if the longest possible matches occur last;
unknown
otherwise.
The subexpression match rule in the rationale requires
first.
- BUGS
-
Miscellaneous bugs (see
categorize.dat
for specific examples):
- alternation-order
-
A change in the order of subexpression alternation operands,
not involved in a tie,
changes
regmatch_t
values.
Some implementations with this bug can be coaxed into missing the
overall longest match.
- first-match
-
The first of the leftmost matches, instead of the longest of the
leftmost matches, is returned.
- nomatch-match
-
A back-reference to a
regmatch_t
(-1,-1) value is treated as matching.
- range-null
-
A range-repeated subexpression that matches null does not report the match
at offset (0,0).
- repeat-artifact
-
A
regmatch_t
value is reported for a repeated match that is not the last match.
- repeat-artifact-nomatch
-
To prevent not matching,
a
regmatch_t
value is reported for a repeated match that is not the last match.
- repeat-null
-
A repeated subexpression matches the null string even though it is not
the only match and is not necessary to satisfy the exact or minimum
number of occurrences for an interval expression.
- repeat-short
-
Incorrect
regmatch_t
values for a repeated subexpression.
This may be a variant of
repeat-artifact.
- subexpression-first
-
A subexpression match takes precedence over a subpattern
to its left.
|
|
Glenn Fowler |
|
|
Information and Software Systems Research |
|
|
AT&T Labs Research |
|
|
Florham Park NJ |
|
|
August 23, 2007 |
|