Next: , Previous: POSIX I/O utilities, Up: POSIX interface


8.9 Regular expressions

The procedures in this section provide access to POSIX regular expression matching. The regular expression syntax and semantics are far too complex to be described here.

Note: Because the C interface uses ASCII NUL bytes to mark the ends of strings, patterns & strings that contain NUL characters will not work correctly.

8.9.1 Direct POSIX regular expression interface

The first interface to regular expressions is a thin layer over the interface that POSIX provides. It is exported by the structures posix-regexps & posix.

— procedure: make-regexp string option ... –> regexp
— procedure: regexp? object –> boolean

Make-regexp creates a regular expression with the given string pattern. The arguments after string specify various options for the regular expression; see regexp-option below. The regular expression is not compiled until it is matched against a string, so any errors in the pattern string will not be reported until that point. Regexp? is the disjoint type predicate for regular expression objects.

— syntax: regexp-option name –> regexp-option

Evaluates to a regular expression option, suitable to be passed to make-regexp, with the given name. The possible option names are:

extended
use the extended patterns
ignore-case
ignore case differences when matching
submatches
report submatches
newline
treat newlines specially

— procedure: regexp-match regexp string start submatches? starts-line? ends-line? –> boolean or list of matches

Regexp-match matches regexp against the characters in string, starting at position start. If the string does not match the regular expression, regexp-match returns #f. If the string does match, then a list of match records is returned if submatches? is true or #t if submatches? is false. The first match record gives the location of the substring that matched regexp. If the pattern in regexp contained submatches, then the submatches are returned in order, with match records in the positions where submatches succeeded and #f in the positions where submatches failed.

Starts-line? should be true if string starts at the beginning of a line, and ends-line? should be true if it ends one.

— procedure: match? object –> boolean
— procedure: match-start match –> integer
— procedure: match-end match –> integer
— procedure: match-submatches match –> alist

Match? is the disjoint type predicate for match records. Match records contain three values: the beginning & end of the substring that matched the pattern and an association list of submatch keys and corresponding match records for any named submatches that also matched. Match-start returns the index of the first character in the matching substring, and match-end gives the index of the first character after the matching substring. Match-submatches returns the alist of submatches.

8.9.2 High-level regular expression construction

This section describes a functional interface for building regular expressions and matching them against strings, higher-level than the direct POSIX interface. The matching is done using the POSIX regular expression package. Regular expressions constructed by procedures listed here are compatible with those in the previous section; that is, they satisfy the predicate regexp? from the posix-regexps structure. These names are exported by the structure regexps.

8.9.2.1 Character sets

Character sets may be defined using a list of characters and strings, using a range or ranges of characters, or by using set operations on existing character sets.

— procedure: set char-or-string ... –> char-set-regexp
— procedure: range low-char high-char –> char-set-regexp
— procedure: ranges low-char high-char ... –> char-set-regexp
— procedure: ascii-range low-char high-char –> char-set-regexp
— procedure: ascii-ranges low-char high-char ... –> char-set-regexp

Set returns a character set that contains all of the character arguments and all of the characters in all of the string arguments. Range returns a character set that contains all characters between low-char and high-char, inclusive. Ranges returns a set that contains all of the characters in the given set of ranges. Range & ranges use the ordering imposed by char->integer. Ascii-range & ascii-ranges are like range & ranges, but they use the ASCII ordering. Ranges & ascii-ranges must be given an even number of arguments. It is an error for a high-char to be less than the preceding low-char in the appropriate ordering.

— procedure: negate char-set –> char-set-regexp
— procedure: union char-seta char-setb –> char-set-regexp
— procedure: intersection char-seta char-setb –> char-set-regexp
— procedure: subtract char-seta char-setb –> char-set-regexp

Set operations on character sets. Negate returns a character set of all characters that are not in char-set. Union returns a character set that contains all of the characters in char-seta and all of the characters in char-setb. Intersection returns a character set of all of the characters that are in both char-seta and char-setb. Subtract returns a character set of all the characters in char-seta that are not also in char-setb.

— character set: lower-case = (set "abcdefghijklmnopqrstuvwxyz")
— character set: lower-case = (set "abcdefghijklmnopqrstuvwxyz")
— character set: upper-case = (set "ABCDEFGHIJKLMNOPQRSTUVWXYZ")
— character set: alphabetic = (union lower-case upper-case)
— character set: numeric = (set "0123456789")
— character set: alphanumeric = (union alphabetic numeric)
— character set: punctuation = (set "!\"#$%&'()*+,-./:;<=>?@[\\]^_`{|}~")
— character set: graphic = (union alphanumeric punctuation)
— character set: printing = (union graphic (set #\space))
— character set: control = (negate printing)
— character set: blank = (set #\space (ascii->char 9)) ; ASCII 9 = TAB
— character set: whitespace = (union (set #\space) (ascii-range 9 13))
— character set: hexdigit = (set "0123456789ABCDEF")

Predefined character sets.

8.9.2.2 Anchoring
— procedure: string-start –> regexp
— procedure: string-end –> regexp

String-start returns a regular expression that matches the beginning of the string being matched against; string-end returns one that matches the end.

8.9.2.3 Composite expressions
— procedure: sequence regexp ... –> regexp
— procedure: one-of regexp ... –> regexp

Sequence returns a regular expression that matches concatenation of all of its arguments; one-of returns a regular expression that matches any one of its arguments.

— procedure: text string –> regexp

Returns a regular expression that matches exactly the characters in string, in order.

— procedure: repeat regexp –> regexp
— procedure: repeat count regexp –> regexp
— procedure: repeat min max regexp –> regexp

Repeat returns a regular expression that matches zero or more occurrences of its regexp argument. With only one argument, the result will match regexp any number of times. With two arguments, i.e. one count argument, the returned regular expression will match regexp exactly that number of times. The final case will match from min to max repetitions, inclusive. Max may be #f, in which case there is no maximum number of matches. Count & min must be exact, non-negative integers; max should be either #f or an exact, non-negative integer.

8.9.2.4 Case sensitivity

Regular expressions are normally case-sensitive, but case sensitivity can be manipulated simply.

— procedure: ignore-case regexp –> regexp
— procedure: use-case regexp –> regexp

The regular expression returned by ignore-case is identical to its argument except that the case will be ignored when matching. The value returned by use-case is protected from future applications of ignore-case. The expressions returned by use-case and ignore-case are unaffected by any enclosing uses of these procedures.

By way of example, the following matches "ab", but not "aB", "Ab", or "AB":

          (text "ab")

while

          (ignore-case (text "ab"))

matches all of those, and

          (ignore-case (sequence (text "a")
                                 (use-case (text "b"))))

matches "ab" or "Ab", but not "aB" or "AB".

8.9.2.5 Submatches and matching

A subexpression within a larger expression can be marked as a submatch. When an expression is matched against a string, the success or failure of each submatch within that expression is reported, as well as the location of the substring matched by each successful submatch.

— procedure: submatch key regexp –> regexp
— procedure: no-submatches regexp –> regexp

Submatch returns a regular expression that is equivalent to regexp in every way except that the regular expression returned by submatch will produce a submatch record in the output for the part of the string matched by regexp. No-submatches returns a regular expression that is equivalent to regexp in every respect except that all submatches generated by regexp will be ignored & removed from the output.

— procedure: any-match? regexp string –> boolean
— procedure: exact-match? regexp string –> boolean
— procedure: match regexp string –> match or #f

Any-match? returns #t if string matches regexp or contains a substring that does, or #f if otherwise. Exact-match? returns #t if string matches regexp exactly, or #f if it does not.

Match returns #f if string does not match regexp, or a match record if it does, as described in the previous section. Matching occurs according to POSIX. The match returned is the one with the lowest starting index in string. If there is more than one such match, the longest is returned. Within that match, the longest possible submatches are returned.

All three matching procedures cache a compiled version of regexp. Subsequent calls with the same input regular expression will be more efficient.

Here are some examples of the high-level regular expression interface:

     (define pattern (text "abc"))
     
     (any-match? pattern "abc")            => #t
     (any-match? pattern "abx")            => #f
     (any-match? pattern "xxabcxx")        => #t
     
     (exact-match? pattern "abc")          => #t
     (exact-match? pattern "abx")          => #f
     (exact-match? pattern "xxabcxx")      => #f
     
     (let ((m (match (sequence (text "ab")
                               (submatch 'foo (text "cd"))
                               (text "ef")))
              "xxabcdefxx"))
       (list m (match-submatches m)))
         => (#{Match 3 9} ((foo . #{Match 5 7})))
     
     (match-submatches
      (match (sequence (set "a")
                       (one-of (submatch 'foo (text "bc"))
                               (submatch 'bar (text "BC"))))
             "xxxaBCd"))
         => ((bar . #{Match 4 6}))