AT&T Home | AT&T Labs | Research
AT&T Labs, Inc. - Research

The Yoix® Scripting Language

Home | What's New | Grammar | Documentation | Download | License | YDAT | YWAIT | Byzgraf | FAQs
Regexp typedict
 
A Regexp object represents a regular expression and is used for pattern matching in text. The fields in a Regexp are:
pattern A String that specifies the regular expression pattern. Pattern syntax is discussed below.
type An int that contains zero or more of the following flags:
CASE_INSENSITIVE Indicates that pattern matching should be case insensitive.
SHELL_PATTERN Indicates that the pattern should be interpreted using shell pattern matching conventions. Pattern syntax is discussed below.
RAWSHELL_PATTERN Like SHELL_PATTERN, this flag means the pattern should be interpreted using shell pattern matching conventions, but the difference in this case is that a backslash character has no special meaning whereas for SHELL_PATTERN a backslash can be used to escape or negate the special meaning of the character that follows it.
TEXT_PATTERN Indicates that the pattern should be interpreted literally with no characters having a special meaning as there are in the other pattern instances.
AUTO_PATTERN Indicates that the pattern should be interpreted as one or more TEXT_PATTERN, SHELL_PATTERN or RAWSHELL_PATTERN sub-patterns each separated by a pair of vertical bars (pipes), which indicates an OR operation should apply among all the sub-patterns. When a sub-pattern contains an asterisk, a question mark or a square-bracketed character class, the sub-pattern is considered either a SHELL_PATTERN or a RAWSHELL_PATTERN, that latter case occurring when the RAWSHELL_PATTERN constant is included in the type flag. Otherwise, the sub-pattern is considered to be a simple TEXT_PATTERN. When a sub-pattern is determined to be either type of shell pattern , then a backslash can be used to escape the meaning of the asterisk, question mark or square-brackets.
SINGLE_BYTE Indicates that character classes should be restricted to one byte. When pattern matching is against Unicode character sets whose second byte is always zero it is more efficient to restrict character classes to a single byte since that is all that is needed. The standard encoding for the US, ISO8859_1, is an example of case where single byte character classes would be appropriate. Character classes are discussed below.
The default value is SINGLE_BYTE.
Regular expressions used in Yoix can take one of two forms: either standard regular expression syntax or the simplified syntax commonly associated with shell pattern matching. For standard regular expressions, the following syntax applies:
  • Any character, except characters with special syntactic meaning in a standard regular expression, matches one occurrence of that character. A backslash (\) can be used to remove the special syntactic meaning of the character that follows the backslash.
  • A dot or period (.) matches one occurrence of any character.
  • A non-empty string, s, enclosed in brackets, namely [s], is called a character class and matches one occurrence of any of the characters represented by s. A substring in s of the form x-y, where x is lexicographically before y, represents all characters in the range of x through y, inclusive. A right bracket (]) it should be the first character in s, while a dash (-) should be the first or last characters in s. A caret, (^) can be used as follows, [^s], to indicate a character class that matches one occurrence of any of the characters not represented by s.
  • An asterisk (*) indicates that the regular expression component immediately preceding it applies to zero or more match occurrences of that component.
  • An plus (+) indicates that the regular expression component immediately preceding it applies to one or more match occurrences of that component.
  • An question mark (?) indicates that the regular expression component immediately preceding it applies to zero or one match occurrences of that component.
  • Parentheses can be used to group together several regular expression components into a subexpression that is treated as a single regular expression component.
  • A vertical bar (|) separating two regular expression components becomes a single regular expression component in which either of the original two regular expression components may cause a match occurrence.
  • A caret (^) at the start of a regular expression constrains the regular expression to matches occurring at the beginning of the target text. A dollar sign ($) at the end of a regular expression constrains the regular expression to matches occurring at the end of the target text.
Standard regular expression match the longest match, starting from the left, in the target text.

Simplified, or shell pattern, regular expressions have the following syntax:

  • Any character, except characters with special syntactic meaning in a simplified regular expression, matches one occurrence of that character. A backslash (\) can be used to remove the special syntactic meaning of the character that follows the backslash.
  • A question mark (?) matches one occurrence of any character.
  • An asterisk (*) matches zero or more occurrences of any character.
  • A non-empty string, s, enclosed in brackets, namely [s], is called a character class and matches one occurrence of any of the characters represented by s. A substring in s of the form x-y, where x is lexicographically before y, represents all characters in the range of x through y, inclusive. To represent a right bracket (]) it should be the first character in s. To represent a dash (-) it should be the last character in s. An exclamation point (!) can be used as follows, [!s], to indicate a character class that matches one occurrence of any of the characters not represented by s.
A simplified regular expression must match the entire target text to be considered a successful match.

Character classes describe a set of characters enclosed in square brackets, ([]), and, normally, will match any single character from the set. If the first character in the set is a caret (^), in the case of standard regular expressions, or an exclamation point (!), in the case of shell-pattern expressions, then any single character not in the set will be matched. Separating two characters by a dash (-) indicates all the Unicode characters between and including those two characters. To include a right square bracket (]), it should be the first character in the set (after a possible caret). To include a dash, it should be the last character in the set.

When matching a pattern, the longest of all possible matches is used.
 
 Example:   The program,
import yoix.*.*;

Regexp re;
String line;
Stream page =
    open("http://www.research.att.com/sw/tools/yoix/", "r");

re.pattern = "^[k-n]|(the)";

while (line = page.nextline) {
    if (regexec(re, line))
        puts(line);
}
will print to stdout all lines in the HTML source of the Yoix home page that either begin with the letters "k" through "n" or contain the letters "the" together.
 
 See Also:   gsubsti, gvsubsti, regexec, regexp, regsub, Subexp, substi, vsubsti

 

Yoix is a registered trademark of AT&T Intellectual Property.