Next: System Utilities
Up: PLT MzScheme: Language Manual
Previous: Custodians
MzScheme provides built-in support for regular expression pattern
matching on strings, implemented by Henry Spencer's package. Regular
expressions are specified as strings, using the same pattern language
as the Unix utility egrep. String-based regular expressions can
be compiled into a regexp value for repeated matches.
Figure 10.1: Grammar for regular expressions
The format of a regular expression is specified by the grammar in
Figure 10.1. A few subtle points about the regexp
language are worth noting:
- When an opening square bracket (``['') that starts a range is
immediately followed by a closing square bracket (``]''), then the
closing square bracket is part of the range, instead of ending an
empty range. For example, "[]a]" matches any string that
contains a lowercase ``a'' or a closing square bracket. A dash (``-'')
at the start or end of a range is treated specially in the same way.
- When a caret (``^'') or dollar sign (``$'') appears in
the middle of a regular expression (not in a range), the resulting
regexp is legal even though it is usually not matchable. For example,
"a$b" is unmatchable because no string can contain the letter
``b'' after the end of the string. On the other hand, "a$b*"
matches any string that ends with a lowercase ``a'', since zero ``b''s
will match the part of the regexp after ``$''.
- A backslash (``\'') in a regexp pattern specified
with a Scheme string literal must be protected with an additional
backslash. For example, the string
"\\." describes a pattern that
matches any string containing a period. In this case, the first
backslash protects the second to generate a Scheme string containing
two characters; the second backslash (which is the first slash in the
actual string value) protects the period in the regexp pattern.
The regular expression procedures are:
- (regexp str) takes a string
representation of a regular expression and compiles it into a regexp
value. Other regular expression procedures accept either a string or a
regexp value as the matching pattern. If a regular expression string is
used multiple times, it is faster to compile the string once to a regexp
value and use it for repeated matches instead of using the string each
time.
- (regexp? v) returns #t
if v is a regexp value created by regexp, #f otherwise.
- (regexp-match pattern str)
attempts to match pattern (a string or a regexp value) to a
portion of str. If the match fails, #f is returned. If
the match succeeds, a list of strings is returned. The first
string in this list is the portion of str that matched
pattern. If two portions of str can match pattern,
then the earliest and longest match is found.
Additional strings are returned in the list if
pattern contains parenthesized subexpressions; matches for
the subexpressions are provided in the order of the opening
parentheses in pattern. When sub-expressions occur in different
branches of an ``or'' (|), a #f is returned for each
sub-expression that was not used in the match. When a single
sub-expression is used multiple times in a match (e.g., a
sub-expression is followed by ``*''), then the rightmost match
associated with the sub-expression is returned in the list.
-
(regexp-match-positions pattern str) is like
regexp-match, but instead of returning a list of strings, a
list of number pairs is returned. Each pair of numbers refers to the
range of characters in str that matched the corresponding
(sub-)expression.
-
(regexp-replace pattern src insert) performs
a match using pattern on src and then returns a string in
which the matching portion of src is replaced with insert.
If pattern matches no part of src, then src is
returned unmodified.
If insert contains ``&'', then ``&'' is replaced with the matching
portion of src before it is substituted into src. If
insert contains ``\n'' (for some integer n),
then it is replaced with the nth matching sub-expression from
src. ``&'' and
``\0'' are synonymous. If the nth sub-expression was
not used in the match or if n is greater than the number of
sub-expressions in pattern, then ``\n'' is
replaced with the empty string.
A literal ``&'' or ``\'' is specified as
``\&'' or ``\\'', respectively. If
insert contains ``\$'', then ``\$'' is
replaced with the empty string. (This can be used to terminate a
number n following a backslash.) If a ``\'' is
followed by anything other than a digit, ``&'', ``\'', or
``$'', then it is treated as ``\0''.
-
(regexp-replace* pattern src insert) is
the same as regexp-replace, except that every instance of
pattern in src is replaced with insert. Only
non-overlapping instances of pattern in the original src
are replaced, so instances of pattern within inserted strings
are not replaced recursively.
Examples:
(define r (regexp "(-[0-9]*)+"))
(regexp-match r "a-12-345b") ; => ("-12-345" "-345")
(regexp-match-positions r "a-12-345-b") ; => ((1 . 9) (5 . 9))
(regexp-match "x+" "12345") ; => #f
(regexp-replace "me" "me casa" "su") ; => "su casa"
(define r2 (regexp "([Mm])e ([a-zA-Z]*)"))
(define insert "\\1y \\2")
(regexp-replace r2 "Me Casa" insert) ; => "My Casa"
(regexp-replace r2 "me cerveza Me Me Me" insert) ; => "my cerveza Me Me Me"
(regexp-replace* r2 "me cerveza Me Me Me" insert) ; => "my cerveza My Me Me"
Next: System Utilities
Up: PLT MzScheme: Language Manual
Previous: Custodians
PLT