Apply Function to Data Blocks

DESCRIPTION:

Apply an arbitrary S-PLUS function to multiple data blocks within the input dataset.

This function requires the bigdata library section to be loaded.

USAGE:

bd.by.group(data, by.columns, FUN, args = NULL,
             output=T, sort=T)

REQUIRED ARGUMENTS:

data
input data set, a bdFrame or data.frame.
by.columns
names or numbers of columns defining how the input data is divided into blocks.
FUN
a function of a single argument (a data frame). The data for each input data block is converted to a data frame and passed to this function.

The FUN argument is a S-PLUS function that is called to process a data frame. This function itself cannot perform any big data operations, or an error is generated.

OPTIONAL ARGUMENTS:

args
list of additional arguments passed to the function. If this is NULL, then the FUN function should have only one argument, the input data block. If this is a list, then the elements are passed as additional arguments to the FUN function. If the list elements have names, these must match argument names for the FUN function.
output
determines whether this function collects the values computed by the FUN function. This could be set to FALSE to execute a function with side-effects.
sort
if FALSE, don't sort the input data by by.columns first

VALUE:

If the output argument is TRUE, this function returns a bdFrame or data.frame, of the same type as data, appending the data frames output by the FUN function. If the output argument is FALSE, this function returns NULL.

DETAILS:

This function applies the S-PLUS function ( FUN) to multiple data blocks within the input data as defined by by.columns. Each data block is converted to a data.frame, and passed to the FUN function. If one of the data blocks is too large to fit in memory, an error will occur. This function is more flexible than because it supports any S-PLUS function, rather than a fixed set of aggregation functions, but it has the limitation that all of the data blocks must fit into memory.

Each unique combination of values in the columns by.columns that appears in the data defines one data block. Normally, these columns contain strings or factors with a limited number of unique values, but this function will work with any column type. For example, if one of the columns in by.columns contains numeric data with different values for each row, then the input data will be divided into blocks with one row each.

If sort is TRUE, then the input data is first sorted by the columns in by.columns, so each of the blocks is guaranteed to have unique values for these columns. If sort is FALSE, then the input data is not sorted, and the blocks are determined by scanning through the rows in order. When any of the by.columns values changes, this signals the beginning of another block. Specifying sort as FALSE is normally used when the data is already sorted, to avoid an unnecessary sort.

SEE ALSO:

, , , .

EXAMPLES:

## Divide fuel.frame into blocks defined by the Type column,
## and for each block compute minWeight (the minimum value
## of the Weight column) and blockSize (the number of rows
## in the block).
bd.by.group(fuel.frame, "Type",
             function(df)
                 data.frame(Type=df[1,"Type"],
                            minWeight=min(df$Weight),
                            blockSize=nrow(df)))
## Divide fuel.frame into blocks defined by the Type column,
## and print each of these block data frames.
## Returns NULL.
bd.by.group(fuel.frame, "Type",
             function(df) print(df),
             output=F)
## Divide fuel.frame into blocks defined by the Type column,
## and print each of these block data frames, ignoring some
## types specified via the args argument.
## Returns NULL.
bd.by.group(fuel.frame, "Type",
  function(df, ignore.types)
    if (!is.element(df$Type[1], ignore.types)) print(df),
  output=F, args=list(ignore.types=c("Compact","Large")))