regexpr(pattern, text, extended = TRUE, fixed = FALSE, ignore.case = FALSE, perl = FALSE, subpattern = 0)
extended
,
fixed
,
ignore.case
,
and
perl
. By default it is an extended
regular expression, as defined by POSIX 1003.2.
Only one of
extended
,
fixed
,
and
perl
.
may be
TRUE
.
TRUE
the pattern is treated
as an extended regular expression. Otherwise it is a basic
(or 'obsolete') regular expression. In Splus 7 and before
regexpr supported only one form of regular expression, the
basic regular expression. Old code may have to be changed
to work with the new default of extended regular expressions.
The main difference is that the characters '+', '?', '|', '(',
')', '{', and '}' are ordinary characters in basic regular expressions
but operators in extended regular expression, and hence must
be preceded by a doubled backslash to be considered ordinary
characters.
TRUE
the pattern is not treated
as a regular expression at all, but as a literal sequence of
characters.
TRUE
upper- and lowercase
analogs are considered equivalent when matching.
TRUE
the pattern is interpreted
as perl language regular expression. Otherwise the value
of the
extended
determines how
the pattern is interpreted.
subpattern
is a positive integer, then the return value refers
to the match corresponding to that parenthesized subpattern.
The subpatterns are numbered from the left by counting right parentheses.
If
subpattern
is -1 then the main
part of the return value is the normal one, but a list of the start
and length information for each subpattern is attached to the
result as the attribute "subpatterns".
subpattern 0
refers to the entire pattern.
"match.length"
,
gives the length of the longest possible matching substring
starting at that position.
The
pattern
argument specifies
a regular expression.
Certain punctuation characters are interpreted specially, as described below.
Other characters in the pattern match the same character in the text.
Case is significant unless
ignore.case
is
TRUE
.
First we describe extended regular expressions. The definition is recursive.
An 'atom' may be a single character other than one of '^.[$()|*+?{\\', in which case it matches itself. An atom may be a period, '.', which matches any character. An atom may be '^' or '$', which match the start or end of a string, respectively. An atom may be a backslash, '\\', followed by any character, which matches the character following the backslash. (In Splus backslashes in strings are doubled.) An atom may also be a bracket expression (see below) or a (possibly empty) regular expression enclosed in parentheses, in which case it matches what the bracket expression or regular expression matches.
A 'bracket expression' is a list of characters or character ranges (2 characters separated by a hyphen, '-') enclosed in square brackets. It matches any character in the list. If the list starts with a circumflex, '^', then it matches any character not in the remainder of the list. To include a '-' in a bracket list, make it the last entry.
A bracket expression may also contain a 'character class' of the form '[:name:]' where name is one of 'alpha', 'lower', 'upper', 'digit', 'alnum', 'blank', 'cntrl', 'punct', 'space', 'xdigit', 'print'. These match any alphabetic character, any lowercase alphabetic character, etc.
The start and end of a word are matched by the special patterns '[[:<:]]' and '[[:>:]]', respectively, where a 'word' is a sequence of 1 or more alphanumerics and underscores. These are extensions to the POSIX standard. Many other programs use '\\<' and '\\>' to match the start and end of a word, but S-PLUS does not now (it did in version 7.0 and before but it would violate the POSIX standard).
A 'piece' of a regular expression is an atom, possibly following by a repeat quantifier: an asterisk ('*', 0 or more repeats), a plus sign ('+', 1 or more repeats), a question mark, ('?', 0 or 1 repeats), or a bound, '{min,max}' or '{count}' or '{min,}' or '{,max}'. The bound {min,max} means between min and max repeats. If min is missing it is taken to be 0 and if max is missing it is taken to be infinity. If there is no comma then it matches exactly the given count of repeats. E.g., '+' is equivalent to '{1,}', '*' is '{0,}', and '?' is '{0,1}'.
Finally, a 'branch' is a sequence of pieces, concatenated, and a 'regular expression' is a sequence of branches separated by vertical bars, '|'. The regular expression matches if any branch in it matches.
The above description was of extended regular expressions (the default type). 'Basic' regular expressions treat '+', '?', and '|' as ordinary characters. The delimiters for bounds are '\\{' and '\\}' instead of '{' and '}' so the latter are also treated as ordinary characters. The parentheses for nested subexpressions are '\\(' and '\\)' instead of '(' and ')' (so parentheses are ordinary characters). '$' is only special at the end of a regular expression and '^' is only special at the beginning of a regular expression. '*' is not special at the beginning of a regular expression.
When
extended=TRUE
, parentheses are
considered to be part of the pattern language and
must be preceded by a (doubled) backslash to be taken
literally. Unmatched (and unescaped) parentheses
result in an error. In some other extended regular
expression parsers (e.g., R's), unmatched parentheses
are taken literally and only matched parentheses are
considered to be grouping symbols.
x <- c("10 Sept", "Oct 9th", "Jan 2", "4th of July") # Find the numbers in the above strings: w <- regexpr("[0-9]+", x) w # Gives: # [1] 1 5 5 1 # attr(, "match.length"): # [1] 2 1 1 1 # Extract the numbers: as.numeric(substring(x, w, w+attr(w, "match.length")-1)) # Gives: # [1] 10 9 2 4 # Extract the capitalized words w1 <- regexpr("[[:<:]][A-Z][a-z]*", x) substring(x, w1, w1+attr(w1, "match.length")-1) # Gives: # [1] "Sept" "Oct" "Jan" "July" # Do the same with substituteString. Note that \\n in # the replacement string refers to the n'th parenthesized # subexpression in the pattern. substituteString("(.*)([[:<:]][A-Z][a-z]*)(.*)", "\\2", x) # Rewrite the dates in "day Month" format, in 3 steps # First move number to front and remove 'th's. \\n # in the replacement pattern refers to the n'th parenthesized # subexpression in the pattern. tmp <- substituteString("([^0-9]*)([0-9]+)[a-z]*([^0-9]*)", "\\2 \\1\\3", x) # remove lowercase words tmp <- substituteString("[[:<:]][[:lower:]]+", "", tmp) # remove repeated spaces tmp <- substituteString(" {2,}", " ", tmp) # remove leading and trailing spaces tmp <- substituteString("^ +| +$", "", tmp) tmp # Gives: # [1] "10 Sept" "9 Oct" "2 Jan" "4 July" # get the integer part of numbers s <- c("-14.0e-05", ".002", "1,700", "+1999.999", "$34.50") r <- regexpr("^ *[-+$]?([0-9,]+)", s, subpattern = 0) substring(s, r, r + attr(r, "match.length") - 1) # returns: "-14" "" "1,700" "+1999" "$34" r <- regexpr("^ *[-+$]?([0-9,]+)", s, subpattern = 1) substring(s, r, r + attr(r, "match.length") - 1) # returns: "14" "" "1,700" "1999" "34"
This function uses the regex regular expression matching code written by Henry Spencer at the University of Toronto. It is the alpha3.8 release, dated Tue Aug 10 15:51:48 EDT 1999.