Pattern Matching in Strings

DESCRIPTION:

Search for a pattern matching a regular expression in character strings.

USAGE:

regexpr(pattern, text, extended = TRUE, fixed = FALSE, ignore.case = FALSE, perl = FALSE, subpattern = 0)

REQUIRED ARGUMENTS:

pattern
character string specifying the pattern to search for. The interpretation of the pattern is controlled by the logical-valued arguments extended, fixed, ignore.case, and perl. By default it is an extended regular expression, as defined by POSIX 1003.2. Only one of extended, fixed, and perl. may be TRUE.
text
a vector of character strings in which to search.

OPTIONAL ARGUMENTS:

extended
If TRUE the pattern is treated as an extended regular expression. Otherwise it is a basic (or 'obsolete') regular expression. In Splus 7 and before regexpr supported only one form of regular expression, the basic regular expression. Old code may have to be changed to work with the new default of extended regular expressions. The main difference is that the characters '+', '?', '|', '(', ')', '{', and '}' are ordinary characters in basic regular expressions but operators in extended regular expression, and hence must be preceded by a doubled backslash to be considered ordinary characters.
fixed
If TRUE the pattern is not treated as a regular expression at all, but as a literal sequence of characters.
ignore.case
If TRUE upper- and lowercase analogs are considered equivalent when matching.
perl
This argument is not implemented yet. If TRUE the pattern is interpreted as perl language regular expression. Otherwise the value of the extended determines how the pattern is interpreted.
subpattern
Regular expressions may have parenthesized subpatterns in them. If subpattern is a positive integer, then the return value refers to the match corresponding to that parenthesized subpattern. The subpatterns are numbered from the left by counting right parentheses. If subpattern is -1 then the main part of the return value is the normal one, but a list of the start and length information for each subpattern is attached to the result as the attribute "subpatterns". subpattern 0 refers to the entire pattern.

VALUE:

numeric vector with the position in the character string of the first substring matching the regular expression. Minus ones means no match was found. An attribute, "match.length", gives the length of the longest possible matching substring starting at that position.

DETAILS:

The pattern argument specifies a regular expression. Certain punctuation characters are interpreted specially, as described below. Other characters in the pattern match the same character in the text. Case is significant unless ignore.case is TRUE .

First we describe extended regular expressions. The definition is recursive.

An 'atom' may be a single character other than one of '^.[$()|*+?{\\', in which case it matches itself. An atom may be a period, '.', which matches any character. An atom may be '^' or '$', which match the start or end of a string, respectively. An atom may be a backslash, '\\', followed by any character, which matches the character following the backslash. (In Splus backslashes in strings are doubled.) An atom may also be a bracket expression (see below) or a (possibly empty) regular expression enclosed in parentheses, in which case it matches what the bracket expression or regular expression matches.

A 'bracket expression' is a list of characters or character ranges (2 characters separated by a hyphen, '-') enclosed in square brackets. It matches any character in the list. If the list starts with a circumflex, '^', then it matches any character not in the remainder of the list. To include a '-' in a bracket list, make it the last entry.

A bracket expression may also contain a 'character class' of the form '[:name:]' where name is one of 'alpha', 'lower', 'upper', 'digit', 'alnum', 'blank', 'cntrl', 'punct', 'space', 'xdigit', 'print'. These match any alphabetic character, any lowercase alphabetic character, etc.

The start and end of a word are matched by the special patterns '[[:<:]]' and '[[:>:]]', respectively, where a 'word' is a sequence of 1 or more alphanumerics and underscores. These are extensions to the POSIX standard. Many other programs use '\\<' and '\\>' to match the start and end of a word, but S-PLUS does not now (it did in version 7.0 and before but it would violate the POSIX standard).

A 'piece' of a regular expression is an atom, possibly following by a repeat quantifier: an asterisk ('*', 0 or more repeats), a plus sign ('+', 1 or more repeats), a question mark, ('?', 0 or 1 repeats), or a bound, '{min,max}' or '{count}' or '{min,}' or '{,max}'. The bound {min,max} means between min and max repeats. If min is missing it is taken to be 0 and if max is missing it is taken to be infinity. If there is no comma then it matches exactly the given count of repeats. E.g., '+' is equivalent to '{1,}', '*' is '{0,}', and '?' is '{0,1}'.

Finally, a 'branch' is a sequence of pieces, concatenated, and a 'regular expression' is a sequence of branches separated by vertical bars, '|'. The regular expression matches if any branch in it matches.

The above description was of extended regular expressions (the default type). 'Basic' regular expressions treat '+', '?', and '|' as ordinary characters. The delimiters for bounds are '\\{' and '\\}' instead of '{' and '}' so the latter are also treated as ordinary characters. The parentheses for nested subexpressions are '\\(' and '\\)' instead of '(' and ')' (so parentheses are ordinary characters). '$' is only special at the end of a regular expression and '^' is only special at the beginning of a regular expression. '*' is not special at the beginning of a regular expression.

NOTE:

When extended=TRUE, parentheses are considered to be part of the pattern language and must be preceded by a (doubled) backslash to be taken literally. Unmatched (and unescaped) parentheses result in an error. In some other extended regular expression parsers (e.g., R's), unmatched parentheses are taken literally and only matched parentheses are considered to be grouping symbols.

SEE ALSO:

. . . .

EXAMPLES:

x <- c("10 Sept", "Oct 9th", "Jan 2", "4th of July")
# Find the numbers in the above strings:
w <- regexpr("[0-9]+", x)
w
# Gives:
# [1] 1 5 5 1
# attr(, "match.length"):
# [1] 2 1 1 1

# Extract the numbers:
as.numeric(substring(x, w, w+attr(w, "match.length")-1))
# Gives:
# [1] 10  9  2  4

# Extract the capitalized words
w1 <- regexpr("[[:<:]][A-Z][a-z]*", x)
substring(x, w1, w1+attr(w1, "match.length")-1)
# Gives:
# [1] "Sept" "Oct"  "Jan"  "July"
# Do the same with substituteString.  Note that \\n in
# the replacement string refers to the n'th parenthesized
# subexpression in the pattern.
substituteString("(.*)([[:<:]][A-Z][a-z]*)(.*)", "\\2", x)

# Rewrite the dates in "day Month" format, in 3 steps
# First move number to front and remove 'th's.  \\n
# in the replacement pattern refers to the n'th parenthesized
# subexpression in the pattern.
tmp <- substituteString("([^0-9]*)([0-9]+)[a-z]*([^0-9]*)",
          "\\2 \\1\\3", x)
# remove lowercase words
tmp <- substituteString("[[:<:]][[:lower:]]+", "", tmp)
# remove repeated spaces
tmp <- substituteString(" {2,}", " ", tmp)
# remove leading and trailing spaces
tmp <- substituteString("^ +| +$", "", tmp)
tmp
# Gives:
# [1] "10 Sept" "9 Oct"   "2 Jan"   "4 July" 

# get the integer part of numbers
s <- c("-14.0e-05", ".002", "1,700", "+1999.999", "$34.50")
r <- regexpr("^ *[-+$]?([0-9,]+)", s, subpattern = 0)
substring(s, r, r + attr(r, "match.length") - 1)
# returns: "-14"   ""      "1,700" "+1999" "$34"
r <- regexpr("^ *[-+$]?([0-9,]+)", s, subpattern = 1)
substring(s, r, r + attr(r, "match.length") - 1)
# returns: "14"    ""      "1,700" "1999"  "34"

ACKNOWLEDGEMENT:

This function uses the regex regular expression matching code written by Henry Spencer at the University of Toronto. It is the alpha3.8 release, dated Tue Aug 10 15:51:48 EDT 1999.