The expression language can be used for specifying expressions in the following functions: , , and . These functions can alternatively evaluate their expressions within S-PLUS.
The expression language was designed to be easily understood by S-PLUS users, as well as users familiar with Excel formulas and other programming languages. In many cases, an expression in the expression language could be executed directly within S-PLUS without change. The reverse is not true: S-PLUS is a complete programming language with many features (user-defined functions, variables, etc.) not supported by the expression language.
An expression is a combination of constants, operators, function
calls, and references to input column values that are evaluated to
return a single value. For example, the following would be a legal
expression if the input columns included a continuous column
named
PRICE
:
max(PRICE+17.5,100)
The expression is evaluated once for every input
row, and the value is either output as a new column value
(in
bd.create.columns
) or used as a logical value
(in
bd.filter.rows
and
bd.split
).
One feature of the expression language that may be unfamiliar to S-PLUS users is that it enforces strong type checking within an expression. Each subexpression within an expression has a single type that can be determined at the time of parsing, and operators and functions check the types of their arguments at parse time. By enforcing type checking, many of the errors that a user would make constructing expressions can be detected during parsing, before trying to execute a transformation on a large dataset.
Strong typing doesn't prevent functions and operators that can take more than one argument type. For example, the expression language currently supports both addition, such as
<double> + <double>
and concatenation, such as
<string> + <string>
.
However, the permissible types are still restricted:
<double> + <logical>
will give a parse error.
The expression language only supports four types of values:
Doubles (floating point numbers)
Strings
Dates
Logical values (
true
,
false
,
or
NA
).
String and categorical values are both manipulated as strings within the expression language. All four types support an NA (missing) value.
Logical values can be
created and manipulated within an expression,
but cannot be directly read from
or written to datasets.
When reading from or writing to datasets, logical values are treated as doubles,
with zero representing
false
and
any other non-NA value representing
true
.
All of the operations and expressions are designed to detect NA
(missing) argument values, and work appropriately. In most cases, if
any of the arguments to a function are NA, the result is NA. There
are some exceptions, such as
ifelse
,
|
(or), and
&
(and). For example, given the expression
A|B
,
if
A
is true, the result is true, even if
B
is
NA
.
One counterintuitive result is that string manipulations return NA if any of their arguments are NA. Consider the expression:
'The
value is: '+PRICE
If the value of the
PRICE
column is the number
1.3
,
this constructs a string
'The
value is: 1.3'
If the value of
PRICE
column is
NA
,
then the result is a string NA value. The
is.na
function can be used to explicitly detect NA values.
Most errors can be detected at parsing time. These include simple parsing errors (like unbalanced parentheses) and type errors. Currently, the expression language operations and functions do not generate any run-time errors. For example, taking the square root of a negative number will return an NA value, rather than causing an error.
A column reference is a name that looks like a variable in an
expression. For example, take the name
FOO
in the expression
FOO+1
.
It can be distinguished from a function name because all function
names must be directly followed by a parenthesis '
(
'.
A column reference name is a sequence of alphabetic and numeric
characters, along with the underscore (
_
)
and period (.) characters. A name cannot start with a digit 0-9.
Column reference names are case-sensitive. The following are examples
of valid column reference names:
myCol, abc_3, xyz.4
An expression is evaluated once for every row in the input dataset. A column reference is evaluated by retrieving the value of the named column in the input dataset, for the current row.
Column names with disallowed characters (like spaces) can be accessed
by calling the function
get
with a string constant specifying the column name, such as follows:
get('strange chars!')
Note that column references always refer to the values in the input dataset.
Using
,
it is possible to specify multiple expressions to create new columns,
and change the values of input coluns.
However, even in this case, column references always refer to the input columns,
not the newly-computed output columns.
Sometimes, it would be useful to compute one new column, and then
compute another new column based on the newly-computed value of the other column.
This could be done by copying the entire expression for one column into the other expression,
or by calling
bd.create.columns
multiple times.
However, it is easier to use the function
getNew
,
which takes as argument a column reference or constant string,
and references the newly-computed value of that column.
For example, the following code creates a new column
aa
,
and then computes a new column
bb
based on the new value
for
aa
:
bd.create.columns(fuel.frame, c('Weight+10', 'getNew(aa)*2'), c('aa', 'bb'))
Normal double and string constants are supported. Doubles may have a
decimal point, and exponential notation. String are delimited by
double-quote characters, and may include backslashes to include
double quote and backslash characters within the string. The normal
backslash codes will also work (
\r
,
\n
).
Unicode characters can be specified with
\u0234
.
Like S-PLUS, strings may also be delimited with single quotes, in
which case double quote characters can appear unquoted. Some examples
of double and string constants follow:
0.123,12.34e34,-12
:
numeric constants.
'foo'
,
'x\ny'z'
:
string constants.
'foo', 'x\ny'y
'
: string constants.
The operators are the normal arithmetic and logical operators, such as
+ - * / %% | & > < >= <= ! != ==
Parentheses can be used to alter the evaluation order. Most of the operators are only defined for numbers, and will give an error when applied to non-double values.
<double> + <double>
: arithmetic plus.
<double> - <double>
: arithmetic minus.
<double> * <double>
: arithmetic multiply.
<double> / <double>
: arithmetic divide.
<double> %% <double>
: arithmetic remainder.
<double> ^ <double>
: arithmetic exponentiation.
<string> + <any>
: string concatenation.
<any> + <string>
: string concatenation.
<date> + <double>
: add number of days to date.
<double> + <date>
: add number of days to date.
<date> - <date>
: returns number of days between two
dates (with fraction of day).
<date> - <double>
: subtracts number of days from date.
<any> == <any>
: compare two doubles, strings, dates, logicals.
<any> != <any>
: compare two doubles, strings, dates, logicals.
<any> < <any>
: compare two doubles, strings, dates.
<any> > <any>
: compare two doubles, strings, dates.
<any> < = <any>
: compare two doubles, strings, dates.
<any> >= <any>
: compare two doubles, strings, dates.
<logical> & <logical>
: returns
true
if both X and Y are true.
<logical> | <logical>
: returns
true
if either X or Y is true.
- <double>
: unary minus.
+ <double>
: unary plus.
! <double>
: logical not.
The expression language provides a fixed set of functions. A function is called by giving the function name, followed by open-parentheses, followed by zero or more expressions separated by commas, followed by close-parentheses. There can be spaces between the function name and the open parentheses. S-PLUS-style named arguments are not allowed.
asString(<any>)
: convert expression value to string.
asDouble(<string>)
: convert string to double.
formatDouble(<double>,
<decimal symbols string>,
<num digits double>)
: convert double to string, using
the given double formatting string where
<decimal symbols string> = <decimal point character>
<thousands sep character>
. For example:
formatDouble(2002.05123, ".'", 2)
parseDouble(<string>,<decimal symbols string>
)
: convert double to date, using the given double parsing string.
formatDate(<date>,
<formatstring>
)
: convert date to string, using the given date formatting string
parseDate(<string> ,
<parsestring>)
: convert string to date, using
the given date parsing string.
max(<double> ,
<double>)
: maximum of two double values.
min(<double> ,
<double>)
: minimum of two double values.
abs(<double>)
: absolute value of double.
ceiling(<double>)
: smallest integer greater than or equal to the value.
floor(<double>)
: largest integer less than or equal to the value.
round(<double>)
: integer nearest to the value.
int(<double>)
: integer part of value (closest integer between the value and zero).
sqrt(<double>)
: square root.
exp(<double>)
: e raised to the given value.
log(<double>)
: natural log of the value.
log10(<double>)
: log to base 10 of the value.
sin(<double>)
: sine of the value.
cos(<double>)
: cosine of the value.
tan(<double>)
: tangent of the value.
asin(<double>)
: arcsine of the value.
acos(<double>)
: arccos of the value.
atan(<double>)
: arctangent of the value.
random()
: uniformly-distributed in the range
randomGaussian()
: value selected from Gaussian distribution with mean=0.0, stdev=1.0.
Inf()
: positive infinity.
Negative infinity can be generated with
-Inf()
.
bitAND(<double> ,
<double>)
: bitwise AND.
bitOR(<double> ,
<double>)
: bitwise OR.
bitXOR(<double> ,
<double>)
: bitwise XOR.
bitNOT(<double>)
: bitwise complement.
For the bitwise functions, the arguments are coerced to 32-bit integers before performing the operation. These can be used to unpack bits from encoded numbers.
nchar(<string>)
:
number of characters in string.
trim(<string>)
:
trim white space from start and end of string.
upperCase(<string>)
:
convert string to upper case.
lowerCase(<string>)
:
convert string to lower case.
substring(<string> , <pos1> , <pos2>)
:
substring from character positions
pos1
to
pos2.
substring(<string> , <pos1>)
:
substring from character position
pos1
to end of string.
indexOf(<string1> , <string2> , <pos>)
:
first position of
string1
within
string2
,
starting with character position
pos.
Default
value of
-1
if not found.
indexOf(<string1> , <string2>)
:
first position of
string2
within
string1
.
Default value
of
-1
if not found.
lastIndexOf(<string1> , <string2> , <pos>)
:
last position of
string2
within
string1
,
starting with character position
pos
.
Default value
of
-1
if not found.
lastIndexOf(<string1> , <string2>)
:
last position of
string2
within
string1
.
Default value
of
-1
if not found.
startsWith(<string1> , <string2>)
:
returns logical
true
if
string1
starts with the string
string2
,
otherwise returns
false
.
endsWith(<string1> , <string2>)
:
returns logical
true
if
string1
ends with the string
string2
,
otherwise returns
false
.
contains(<string1> , <string2>)
:
returns logical
true
if
string1
contains the string
string2
,
otherwise returns
false
.
charToInt(<string>)
:
takes
the first character of its string argument, and returns the Unicode
character number for the character. If the string is an NA, or has
less than 1 character, it returns NA.
intToChar(<double>)
: converts
its double argument to an integer, and returns a string containing a
single character with that integer's Unicode character number.
translate(<string1> , <fromchars> , <tochars>)
:
translates the characters in the first argument. For each character
in
string
,
if it appears in the string
fromchars
,
it is replaced by the corresponding character in the string
tochars
,
otherwise it is not changed.
For example,
translate(NUMSTRING,
'.,', ',.')
will switch the period
and comma characters in a number string.
If the length of
tochars
is less than the length of
fromchars
,
characters from
tochars
with no corresponding character will be deleted.
For example,
translate(STRING,
'$', '')
will delete any
dollar characters in the string.
If any of the three arguments is NA, this function returns NA.
asDate(<string>)
: convert string to date, using the default date parsing string.
asString(<date>)
: convert date to string, using the default date formatting string.
asDate(<string> ,
<parsestring>)
: convert string to date, using
the given date parsing string.
asString(<date> ,
<formatstring>)
: convert date to string, using
the given date formatting string.
asJulian(<date>)
: convert date to double: julian days plus fraction of day.
asJulianDay(<date>)
: convert date to julian day
==
floor(asJulian(<date>)).
asJulianMsec(<date>)
: extract number of milliseconds from the beginning of the Julian day
for the specified date.
asDateFromJulian(<double>)
: convert julian day+fraction to date.
asDateFromJulian(<double>, <msec>)
: convert julian day and milliseconds from the beginning of the Julian day to a date.
asDate(<double>)
: convert julian day+fraction to date.
asDate(<year> , <month> , <day>)
: construct date from year,month,day doubles.
asDate(<year> , <month> , <day> , <hour> , <minute> , <second>)
: construct date from six double values.
asDate(<year> , <month> , <day> , <hour> , <minute> , <second>, <msec>)
: construct date from seven double values.
now()
: return date representing the current date and time.
year(<date>)
:
extract year from date.
month(<date>)
:
extract month from date (
1-12
).
day(<date>)
:
extract day in month from date (
1-31
).
hour(<date>)
:
extract hour from date (
0-23
).
minute(<date>)
:
extract minute from date (
0-59
).
second(<date>)
:
extract second from date (
0-59
).
msec(<date>)
:
extract millisecond from date (
0-999
).
yearday(<date>)
:
extract day of year from date (
1-366
).
quarter(<date>)
:
extract quarter of year from date (
1-4
).
weekday(<date>)
: extract day of week from date (Sun=
0
,
Mon=
1
,...,
Sat=
6
).
workday(<date>)
: returns logical
true
if weekday is Monday-Friday.
dataRow()
: return current row number within whole dataset.
columnMin(<id>)
: min value for column.
columnMax(<id>)
: max value for column.
columnMean(<id>)
: mean value for column.
columnStdev(<id>)
: standard deviation for column.
columnSum(<id>)
: sum for column.
countMissing(<id>)
: number of missing values in named column.
totalRows()
: total number of rows in whole dataset.
The functions above that take an
argument can take either a plain column reference, or a string
constant naming a column. For example, the following two expressions
are the same:
columnMean(PRICE)
and
columnMean('PRICE')
ifelse(<logical> , <val1> , <val2>)
ifelse(<logical1> , <val1> , <logical2>, <val2>, <val3>)
In the three-argument case,
if
logical
arg is true, this returns
val1
,
otherwise it returns
val2
.
In the five-argument case,
if
logical1
arg is true, this returns
val1
,
else if
logical2
arg is true, this returns
val2
,
else this returns
val3
.
The
ifelse
function can also take seven, nine, etc. arguments,
to handle additional logical tests.
Because of the type checking,
val1
,
val2
, etc. must have the same type.
ifequal(<input> , <test1> , <val1>, <val2>)
ifequal(<input> , <test1> , <val1>, <test2>, <val2>, <val3>)
In the four-argument case,
if
input
arg is equal to
test1
, this returns
val1
,
otherwise it returns
val2
.
In the six-argument case,
if
input
arg is equal to
test1
, this returns
val1
,
else if
input
arg is equal to
test2
, this returns
val2
,
else this returns
val3
.
The
ifequal
function can also take eight, ten, etc. arguments,
to handle additional equalality tests.
Because of the type checking,
input
must have the same type as
test1
,
test2
, etc.,
and
val1
,
val2
, etc. must have the same type.
oneof(<input> , <test1> , <test2>)
returns true if
input
is equal to any of the other arguments.
This can have any number of arguments, but they all must have the same type.
is.na(<any>)
: returns true if the expression is an NA.
NA()
: returns NA value.
get(<column>)
: access the value of an input column.
The column argument must be a column name or a constant string specifying the input column.
This is normally used with a string constant as an argument,
to access columns whose names don't parse as column references
because they contain unusual characters,
such as:
get('strange chars!')
.
getNew(<column>)
: access the newly-computed column value from another column expression
in
.
The column argument must be a column name or a constant string specifying the column expression.
This function will return the newly-computed value for the named column.
This allows one column expression to reference the value of another expression.
The order that the expressions are specified does not matter, i.e. it is possible
to reference expressions defined after the current expression.
However, it is not possible for an expression to refer to its own new value via
getNew
,
directly or through a series of
getNew
calls in multiple expressions.
For example, the following causes
an error:
bd.create.columns(fuel.frame, c('getNew(bb)', 'getNew(aa)'), c('aa', 'bb'))
.
prev(<column>)
prev(<column> , <lag>)
prev(<column> , <lag> , <fill>)
: access column values from previous and following rows.
In the one-argument case, this returns
the value of the specified column for the previous row.
The column argument must be a column name or a constant string,
as in the
get
function.
In the two and three-argument cases, the lag argument specifies which
row is accessed.
Specifying a lag value of 1 gives the previous row,
2 gives the row before that,
and -1 (negative 1) gives the next row.
In the three-argument case, the fill argument gives the value to be
returned if the specified row is beyond the end of the data set.
In the one and two-argument cases, this fill value defaults
to an NA value.
diff(<column>)
diff(<column> , <lag>)
diff(<column> , <lag> , <difference>)
: compute differences for a numeric column.
This is similar to the S-PLUS
diff
function, computing
the difference between the current value for a column and the value from
a previous row.
The column argument must be a column name or a constant string,
as in the
get
function, specifying a numeric column.
The lag argument, which defaults to one, gives the number of rows back to look.
The difference argument, which also defaults to one,
specifies the number of iterated differences to compute.
If the second or third argument is specified, these must be constant values
that are one or greater.
tempvar(<varname>, <initval>, <nextVal>)
: defines a persistant temporary variable.
The first argument must be a constant string, giving the name of a temporary variable.
This variable is initialized to the value of the second argument, an expression which cannot
contain any references to columns or temporary variables.
The third argument is evaluated to give the value for the entire
tempvar
call,
and determine the next value for the temporary variable.
A temporary variable with the specified name can occur anywhere in the expression.
Its value will be the previously-calculated value defined for that variable.
This function is useful for constructing running totals.
For example, the expression
tempvar("cumsum", 0, cumsum+x)
will return
the cumulative sum of the input column
x
.
Note that the third argument references the previous value of the
cumsum
temporary
variable, and uses this to calculate the new value.