Column Aggregate Values Within Data Blocks

DESCRIPTION:

Divide a data object into blocks according to the values of one or more columns, and then apply aggregation functions to columns within each block.

This function requires the bigdata library section to be loaded.

USAGE:

bd.aggregate(data, columns=NULL, by.columns, methods="mean",
                  names.=NULL, sort=T)

REQUIRED ARGUMENTS:

data
input data set. A bdFrame or data.frame.
by.columns
names or numbers of columns defining how the input data is divided into blocks.

OPTIONAL ARGUMENTS:

columns
names or numbers of columns to be summarized.
methods
vector of summary methods to be calculated for the columns in columns. If this is shorter than columns, then the values are repeated to produce an equal length vector.
names.
names of output columns for summary values. If not given, this defaults to the values in column and methods.
sort
a logical value; if FALSE, do not sort the input data by by.columns first.

VALUE:

a bdFrame or data.frame of the same type as data. This contains one row for each block in the input defined by by.columns. The result contains all of the columns in by.columns, as well as all of the columns defined by names..

DETAILS:

Use this function to apply any of a fixed set of aggregation functions to one or more columns. The aggregation functions are applied to multiple data blocks within the input data, as defined by by.columns.

Each unique combination of values in the columns by.columns that appears in the data defines one data block. Normally, these columns contain strings or factors with a limited number of unique values, but this function works with any column type. For example, if one of the columns in by.columns contains numeric data with different values for each row, then the input data is divided into blocks with one row each.

If sort is TRUE, then the input data is first sorted by the columns in by.columns, so each of the blocks is guaranteed to have unique values for these columns. If sort is FALSE, then the input data is not sorted, and the blocks are determined by scanning through the rows in order. When any of the by.columns values changes, this signals the beginning of another block. If the data is already sorted, specify sort as FALSE to avoid an unnecessary sort.

Within each data block defined by by.columns and sort, apply aggregation functions to particular data columns, as specified by columns, methods, and names..

The argument columns specifies a set of input columns to be processed. A given column can appear more than once in this argument, to calculate multiple aggregation functions on it.

The argument methods specifies, for each element of input.columns, the aggregation function that should be calculated for that column. There are a fixed set of possible aggregation functions, described below.

The argument names. specifies the output column names used to output each of the computed aggregate function values. If you do not specify names., then default names are created by concatenating the input column name with the aggregation function. For example, the column name x.mean results from an input column name x and an aggregation function "mean".

The possible aggregation functions that can appear in the value of the methods argument are as follows:

"sum"

Compute the sum of the column values.

"count"
Count the number of elements in the data block. This computes the same value, no matter which column is the input column.
"mean"
Compute the mean value of the column values.
"min"
Compute the minimum value of the column values.
"max"
Compute the maximum value of the column values.
"stdDev"
Compute the standard deviation of the column values.
"var"
Compute the variance of the column values.
"range"
Compute the range of the column values, defined as the maximum minus the minumum values.
"first"
Return the first value in the column within each block.
"last"
Return the last value in the column within each block.

Some of the aggregation functions (such as "sum", "mean", and so on) are only well-defined if the input column is numeric. If the column is non-numeric, then the computed value is undefined. The numeric functions also handle non-missing values specially: for example, "mean" computes the mean of the non-missing values only. It computes NA only if all of the column values in a block are NA.

EXAMPLES:

## Divide fuel.frame into blocks defined by the Type column,
## and for each block compute minWeight (the minimum value
## of the Weight column) and blockSize (the number of rows
## in the block).
bd.aggregate(fuel.frame, 
                  columns=c("Weight", "Weight"),
                  by.columns="Type",
                  methods=c("min", "count"),
                  names=c("minWeight", "blockSize"))
## Compute the min, max, mean of each of the first four
## columns of fuel.frame, within the blocks defined by
## the Type column.  The output columns names default
## to "Weight.min", "Weight.max", etc.
bd.aggregate(fuel.frame, columns=rep(1:4,each=3),
     by.columns="Type", methods=c("min", "max", "mean"))