Expression Language

DESCRIPTION:

The expression language that can be used to create new column values, or select rows within a data set.

To use the expression language, you must first load the Big Data library.

DETAILS:

The expression language can be used for specifying expressions in the following functions: , , and . These functions can alternatively evaluate their expressions within S-PLUS.

Expression Language Design:

The expression language was designed to be easily understood by S-PLUS users, as well as users familiar with Excel formulas and other programming languages. In many cases, an expression in the expression language could be executed directly within S-PLUS without change. The reverse is not true: S-PLUS is a complete programming language with many features (user-defined functions, variables, etc.) not supported by the expression language.

An expression is a combination of constants, operators, function calls, and references to input column values that are evaluated to return a single value. For example, the following would be a legal expression if the input columns included a continuous column named PRICE:

max(PRICE+17.5,100)

The expression is evaluated once for every input row, and the value is either output as a new column value (in bd.create.columns) or used as a logical value (in bd.filter.rows and bd.split).

One feature of the expression language that may be unfamiliar to S-PLUS users is that it enforces strong type checking within an expression. Each subexpression within an expression has a single type that can be determined at the time of parsing, and operators and functions check the types of their arguments at parse time. By enforcing type checking, many of the errors that a user would make constructing expressions can be detected during parsing, before trying to execute a transformation on a large dataset.

Strong typing doesn't prevent functions and operators that can take more than one argument type. For example, the expression language currently supports both addition, such as

<double> + <double>

and concatenation, such as

<string> + <string>.

However, the permissible types are still restricted:

<double> + <logical>

will give a parse error.

Value Types:

The expression language only supports four types of values:

Doubles (floating point numbers)
Strings
Dates
Logical values ( true, false, or NA).

String and categorical values are both manipulated as strings within the expression language. All four types support an NA (missing) value.

Logical values can be created and manipulated within an expression, but cannot be directly read from or written to datasets. When reading from or writing to datasets, logical values are treated as doubles, with zero representing false and any other non-NA value representing true.

NA Handling:

All of the operations and expressions are designed to detect NA (missing) argument values, and work appropriately. In most cases, if any of the arguments to a function are NA, the result is NA. There are some exceptions, such as ifelse, | (or), and & (and). For example, given the expression A|B, if A is true, the result is true, even if B is NA.

One counterintuitive result is that string manipulations return NA if any of their arguments are NA. Consider the expression:

'The value is: '+PRICE

If the value of the PRICE column is the number 1.3, this constructs a string

'The value is: 1.3'

If the value of PRICE column is NA, then the result is a string NA value. The is.na function can be used to explicitly detect NA values.

Error Handling:

Most errors can be detected at parsing time. These include simple parsing errors (like unbalanced parentheses) and type errors. Currently, the expression language operations and functions do not generate any run-time errors. For example, taking the square root of a negative number will return an NA value, rather than causing an error.

Column References:

A column reference is a name that looks like a variable in an expression. For example, take the name FOO in the expression FOO+1. It can be distinguished from a function name because all function names must be directly followed by a parenthesis ' ('.

A column reference name is a sequence of alphabetic and numeric characters, along with the underscore ( _) and period (.) characters. A name cannot start with a digit 0-9. Column reference names are case-sensitive. The following are examples of valid column reference names:

myCol, abc_3, xyz.4

An expression is evaluated once for every row in the input dataset. A column reference is evaluated by retrieving the value of the named column in the input dataset, for the current row.

Column names with disallowed characters (like spaces) can be accessed by calling the function get with a string constant specifying the column name, such as follows:

get('strange chars!')

Note that column references always refer to the values in the input dataset. Using , it is possible to specify multiple expressions to create new columns, and change the values of input coluns. However, even in this case, column references always refer to the input columns, not the newly-computed output columns. Sometimes, it would be useful to compute one new column, and then compute another new column based on the newly-computed value of the other column. This could be done by copying the entire expression for one column into the other expression, or by calling bd.create.columns multiple times. However, it is easier to use the function getNew, which takes as argument a column reference or constant string, and references the newly-computed value of that column. For example, the following code creates a new column aa, and then computes a new column bb based on the new value for aa:

bd.create.columns(fuel.frame, c('Weight+10', 'getNew(aa)*2'), c('aa', 'bb'))

Double and String Constants:

Normal double and string constants are supported. Doubles may have a decimal point, and exponential notation. String are delimited by double-quote characters, and may include backslashes to include double quote and backslash characters within the string. The normal backslash codes will also work ( \r, \n). Unicode characters can be specified with \u0234. Like S-PLUS, strings may also be delimited with single quotes, in which case double quote characters can appear unquoted. Some examples of double and string constants follow:

0.123,12.34e34,-12: numeric constants.
'foo', 'x\ny'z': string constants.
'foo', 'x\ny'y' : string constants.

Expression Language Operators:

The operators are the normal arithmetic and logical operators, such as

+ - * / %% | & > < >= <= ! != ==

Parentheses can be used to alter the evaluation order. Most of the operators are only defined for numbers, and will give an error when applied to non-double values.

<double> + <double>: arithmetic plus.

<double> - <double>: arithmetic minus.

<double> * <double>: arithmetic multiply.

<double> / <double> : arithmetic divide.

<double> %% <double>: arithmetic remainder.

<double> ^ <double>: arithmetic exponentiation.

<string> + <any> : string concatenation.

<any> + <string> : string concatenation.

<date> + <double> : add number of days to date.

<double> + <date> : add number of days to date.

<date> - <date> : returns number of days between two dates (with fraction of day).

<date> - <double> : subtracts number of days from date.

<any> == <any> : compare two doubles, strings, dates, logicals.

<any> != <any> : compare two doubles, strings, dates, logicals.

<any> < <any> : compare two doubles, strings, dates.

<any> > <any>: compare two doubles, strings, dates.

<any> < = <any> : compare two doubles, strings, dates.

<any> >= <any> : compare two doubles, strings, dates.

<logical> & <logical> : returns true if both X and Y are true.

<logical> | <logical> : returns true if either X or Y is true.

- <double>: unary minus.

+ <double>: unary plus.

! <double>: logical not.

Functions:

The expression language provides a fixed set of functions. A function is called by giving the function name, followed by open-parentheses, followed by zero or more expressions separated by commas, followed by close-parentheses. There can be spaces between the function name and the open parentheses. S-PLUS-style named arguments are not allowed.

Conversion Functions:

asString(<any>): convert expression value to string.

asDouble(<string>): convert string to double.

formatDouble(<double>, <decimal symbols string>, <num digits double>) : convert double to string, using the given double formatting string where <decimal symbols string> = <decimal point character> <thousands sep character> . For example:

formatDouble(2002.05123, ".'", 2)

parseDouble(<string>,<decimal symbols string>) : convert double to date, using the given double parsing string.

formatDate(<date>, <formatstring>) : convert date to string, using the given date formatting string

parseDate(<string> , <parsestring>) : convert string to date, using the given date parsing string.

Numeric Functions:

max(<double> , <double>) : maximum of two double values.

min(<double> , <double>) : minimum of two double values.

abs(<double>): absolute value of double.

ceiling(<double>): smallest integer greater than or equal to the value.

floor(<double>): largest integer less than or equal to the value.

round(<double>): integer nearest to the value.

int(<double>): integer part of value (closest integer between the value and zero).

sqrt(<double>): square root.

exp(<double>): e raised to the given value.

log(<double>): natural log of the value.

log10(<double>): log to base 10 of the value.

sin(<double>): sine of the value.

cos(<double>): cosine of the value.

tan(<double>): tangent of the value.

asin(<double>): arcsine of the value.

acos(<double>): arccos of the value.

atan(<double>): arctangent of the value.

random() : uniformly-distributed in the range 0.0,1.0.

randomGaussian() : value selected from Gaussian distribution with mean=0.0, stdev=1.0.

Inf() : positive infinity. Negative infinity can be generated with -Inf().

bitAND(<double> , <double>) : bitwise AND.

bitOR(<double> , <double>) : bitwise OR.

bitXOR(<double> , <double>) : bitwise XOR.

bitNOT(<double>): bitwise complement.

For the bitwise functions, the arguments are coerced to 32-bit integers before performing the operation. These can be used to unpack bits from encoded numbers.

String Functions:

nchar(<string>): number of characters in string.

trim(<string>): trim white space from start and end of string.

upperCase(<string>): convert string to upper case.

lowerCase(<string>): convert string to lower case.

substring(<string> , <pos1> , <pos2>): substring from character positions pos1 to pos2.

substring(<string> , <pos1>): substring from character position pos1 to end of string.

indexOf(<string1> , <string2> , <pos>): first position of string1 within string2, starting with character position pos.Default value of -1 if not found.

indexOf(<string1> , <string2>): first position of string2 within string1. Default value of -1 if not found.

lastIndexOf(<string1> , <string2> , <pos>): last position of string2 within string1, starting with character position pos. Default value of -1 if not found.

lastIndexOf(<string1> , <string2>): last position of string2 within string1. Default value of -1 if not found.

startsWith(<string1> , <string2>): returns logical true if string1 starts with the string string2, otherwise returns false.

endsWith(<string1> , <string2>): returns logical true if string1 ends with the string string2, otherwise returns false.

contains(<string1> , <string2>): returns logical true if string1 contains the string string2, otherwise returns false.

charToInt(<string>): takes the first character of its string argument, and returns the Unicode character number for the character. If the string is an NA, or has less than 1 character, it returns NA.

intToChar(<double>): converts its double argument to an integer, and returns a string containing a single character with that integer's Unicode character number.

translate(<string1> , <fromchars> , <tochars>): translates the characters in the first argument. For each character in string, if it appears in the string fromchars, it is replaced by the corresponding character in the string tochars, otherwise it is not changed. For example, translate(NUMSTRING, '.,', ',.') will switch the period and comma characters in a number string. If the length of tochars is less than the length of fromchars, characters from tochars with no corresponding character will be deleted. For example, translate(STRING, '$', '') will delete any dollar characters in the string. If any of the three arguments is NA, this function returns NA.

Date Manipulation Functions:

asDate(<string>): convert string to date, using the default date parsing string.

asString(<date>): convert date to string, using the default date formatting string.

asDate(<string> , <parsestring>) : convert string to date, using the given date parsing string.

asString(<date> , <formatstring>) : convert date to string, using the given date formatting string.

asJulian(<date>): convert date to double: julian days plus fraction of day.

asJulianDay(<date>): convert date to julian day == floor(asJulian(<date>)).

asJulianMsec(<date>): extract number of milliseconds from the beginning of the Julian day for the specified date.

asDateFromJulian(<double>): convert julian day+fraction to date.

asDateFromJulian(<double>, <msec>): convert julian day and milliseconds from the beginning of the Julian day to a date.

asDate(<double>): convert julian day+fraction to date.

asDate(<year> , <month> , <day>): construct date from year,month,day doubles.

asDate(<year> , <month> , <day> , <hour> , <minute> , <second>): construct date from six double values.

asDate(<year> , <month> , <day> , <hour> , <minute> , <second>, <msec>): construct date from seven double values.

now() : return date representing the current date and time.

year(<date>): extract year from date.

month(<date>): extract month from date ( 1-12).

day(<date>): extract day in month from date ( 1-31).

hour(<date>): extract hour from date ( 0-23).

minute(<date>): extract minute from date ( 0-59).

second(<date>): extract second from date ( 0-59).

msec(<date>): extract millisecond from date ( 0-999).

yearday(<date>): extract day of year from date ( 1-366).

quarter(<date>): extract quarter of year from date ( 1-4).

weekday(<date>): extract day of week from date (Sun= 0, Mon= 1,..., Sat= 6).

workday(<date>): returns logical true if weekday is Monday-Friday.

Dataset Functions:

dataRow() : return current row number within whole dataset.

columnMin(<id>): min value for column.

columnMax(<id>): max value for column.

columnMean(<id>): mean value for column.

columnStdev(<id>): standard deviation for column.

columnSum(<id>): sum for column.

countMissing(<id>): number of missing values in named column.

totalRows() : total number of rows in whole dataset.

The functions above that take an argument can take either a plain column reference, or a string constant naming a column. For example, the following two expressions are the same:

columnMean(PRICE)

and

columnMean('PRICE')

Miscellaneous Functions:

ifelse(<logical> , <val1> , <val2>)
ifelse(<logical1> , <val1> , <logical2>, <val2>, <val3>)
In the three-argument case, if logical arg is true, this returns val1, otherwise it returns val2. In the five-argument case, if logical1 arg is true, this returns val1, else if logical2 arg is true, this returns val2, else this returns val3. The ifelse function can also take seven, nine, etc. arguments, to handle additional logical tests. Because of the type checking, val1, val2, etc. must have the same type.

ifequal(<input> , <test1> , <val1>, <val2>)
ifequal(<input> , <test1> , <val1>, <test2>, <val2>, <val3>)
In the four-argument case, if input arg is equal to test1, this returns val1, otherwise it returns val2. In the six-argument case, if input arg is equal to test1, this returns val1, else if input arg is equal to test2, this returns val2, else this returns val3. The ifequal function can also take eight, ten, etc. arguments, to handle additional equalality tests. Because of the type checking, input must have the same type as test1, test2, etc., and val1, val2, etc. must have the same type.

oneof(<input> , <test1> , <test2>)returns true if input is equal to any of the other arguments. This can have any number of arguments, but they all must have the same type.

is.na(<any>): returns true if the expression is an NA.

NA() : returns NA value.

get(<column>)
: access the value of an input column. The column argument must be a column name or a constant string specifying the input column. This is normally used with a string constant as an argument, to access columns whose names don't parse as column references because they contain unusual characters, such as: get('strange chars!').

getNew(<column>)
: access the newly-computed column value from another column expression in . The column argument must be a column name or a constant string specifying the column expression. This function will return the newly-computed value for the named column. This allows one column expression to reference the value of another expression. The order that the expressions are specified does not matter, i.e. it is possible to reference expressions defined after the current expression. However, it is not possible for an expression to refer to its own new value via getNew, directly or through a series of getNew calls in multiple expressions. For example, the following causes an error: bd.create.columns(fuel.frame, c('getNew(bb)', 'getNew(aa)'), c('aa', 'bb')).

prev(<column>)
prev(<column> , <lag>)
prev(<column> , <lag> , <fill>)
: access column values from previous and following rows. In the one-argument case, this returns the value of the specified column for the previous row. The column argument must be a column name or a constant string, as in the get function. In the two and three-argument cases, the lag argument specifies which row is accessed. Specifying a lag value of 1 gives the previous row, 2 gives the row before that, and -1 (negative 1) gives the next row. In the three-argument case, the fill argument gives the value to be returned if the specified row is beyond the end of the data set. In the one and two-argument cases, this fill value defaults to an NA value.

diff(<column>)
diff(<column> , <lag>)
diff(<column> , <lag> , <difference>)
: compute differences for a numeric column. This is similar to the S-PLUS diff function, computing the difference between the current value for a column and the value from a previous row. The column argument must be a column name or a constant string, as in the get function, specifying a numeric column. The lag argument, which defaults to one, gives the number of rows back to look. The difference argument, which also defaults to one, specifies the number of iterated differences to compute. If the second or third argument is specified, these must be constant values that are one or greater.

tempvar(<varname>, <initval>, <nextVal>)
: defines a persistant temporary variable. The first argument must be a constant string, giving the name of a temporary variable. This variable is initialized to the value of the second argument, an expression which cannot contain any references to columns or temporary variables. The third argument is evaluated to give the value for the entire tempvar call, and determine the next value for the temporary variable. A temporary variable with the specified name can occur anywhere in the expression. Its value will be the previously-calculated value defined for that variable. This function is useful for constructing running totals. For example, the expression tempvar("cumsum", 0, cumsum+x) will return the cumulative sum of the input column x. Note that the third argument references the previous value of the cumsum temporary variable, and uses this to calculate the new value.