Alice Project

The Regex structure

________ Synopsis ____________________________________________________

    signature REGEX
    structure Regex : REGEX

This structure provides an interface to a (subset of) POSIX-compatible regular expressions.
Note: however, that the functions resulting from this partial application cannot be pickled.

________ Import ______________________________________________________

    import structure Regex from "x-alice:/lib/regex/Regex"
    import signature REGEX from "x-alice:/lib/regex/REGEX-sig"

________ Interface ___________________________________________________

    signature REGEX =
	type match

	infix 2 =~

	exception Malformed
	exception NoSuchGroup

	val match      : string -> string -> match option
	val =~         : string * string -> bool

	val groups     : match -> string vector
	val group      : match * int -> string
	val groupStart : match * int -> int
	val groupEnd   : match * int -> int
	val groupSpan  : match * int -> (int * int)


________ Description _________________________________________________

type match

The abstract type of a matching.

exception Malformed

indicates that a regular expression not well-formed.

exception NoSuchGroup

indicates that an access to a group of a match has failed. It does not exists such a group.

match r s

returns SOME m if r matches s and NONE otherwise. It raises Malformed if r is not a well-formed regular expression.

r =~ s

The following equivalence holds:

r =~ s = Option.isSome (match r s)
groups m

returns a string vector of the given matching m

group (m, i)
groupStart (m, i)
groupEnd (m, i)

need a match m and an index i. It raises NoSuchGroup, if i >= Vector.length (groups m) or i < 0.

________ Example _____________________________________________________

This structure provides pattern matching with POSIX 1003.2 regular expressions.

The form and meaning of Extended and Basic regular expressions are described below. Here R and S denote regular expressions; m and n denote natural numbers; L denotes a character list; and d denotes a decimal digit:

Match the character c
Match any character
Match R zero or more times
Match R one or more times
Match R or S
Match R or the empty string
Match R exactly m times
Match R at least m times
Match R at least m and at most n times
Match any character in L
Match any character not in L
Match at string's beginning
Match at string's end
Match R as a group; save the match
Match the same as previous group d
Match \ --- similarly for *.[]^$
Match + --- similarly for |?{}()

Some example character lists L:

Match vowel: a or e or i or o or u
Match digit: 0 or 1 or 2 or ... or 9
Match non-digit
Match - or + or * or / or ^
Match lowercase letter or hyphen (-)
Match hexadecimal digit
Match letter or digit
Match letter
Match ASCII control character
Match decimal digit; same as [0-9]
Same as [:print:] but not [:space:]
Match lowercase letter
Match printable character
Match punctuation character
Match SML #" ", #"\r", #"\n", #"\t", #"\v", #"\f"
Match uppercase letter
Match hexadecimal digit; same as [0-9a-fA-F]

Remember that backslash (\) must be escaped as "\\" in SML strings.

Example: Match SML integer constant:
match "^~?[0-9]+$" [Extended]

Example: Match SML alphanumeric identifier:
match "^[a-zA-Z0-9][a-zA-Z0-9'_]*$" [Extended]

Example: Match SML floating-point constant:
match "^[+~]?[0-9]+(\\.[0-9]+|(\\.[0-9]+)?[eE][+~]?[0-9]+)$" [Extended]

Example: Match any HTML start tag; make the tag's name into a group:
match "<([[:alnum:]]+)[^>]*>" [Extended]

last modified 2007/Mar/30 17:10