This function requires the bigdata library section to be loaded.
bd.by.group(data, by.columns, FUN, args = NULL, output=T, sort=T)
bdFrame
or
data.frame
.
FUN
argument is a S-PLUS function
that is called to process a data frame. This function itself
cannot perform any big data operations, or an error is generated.
NULL
, then the
FUN
function should have only one argument,
the input data block.
If this is a list, then the elements are passed as additional arguments
to the
FUN
function.
If the list elements have names,
these must match argument names for the
FUN
function.
FUN
function.
This could be set to
FALSE
to execute a function
with side-effects.
FALSE
, don't sort the input data by
by.columns
first
output
argument is
TRUE
,
this function returns a
bdFrame
or
data.frame
,
of the same type as
data
, appending the data frames output
by the
FUN
function.
If the
output
argument is
FALSE
,
this function returns
NULL
.
This function applies the S-PLUS function (
FUN
) to
multiple data blocks within the input data as defined by
by.columns
.
Each data block is converted to a
data.frame
,
and passed to the
FUN
function.
If one of the data blocks is too large to fit in memory, an error will occur.
This function is more flexible than
because
it supports any S-PLUS function, rather than a fixed set of aggregation functions,
but it has the limitation that all of the data blocks must fit into memory.
Each unique combination of values in the columns
by.columns
that appears
in the data defines one data block. Normally, these columns contain strings or
factors with a limited number of unique values, but this function will work with
any column type. For example, if one of the columns in
by.columns
contains
numeric data with different values for each row, then the input data will be divided
into blocks with one row each.
If
sort
is
TRUE
, then the input data
is first sorted by the columns in
by.columns
, so each of the blocks
is guaranteed to have unique values for these columns.
If
sort
is
FALSE
, then the input data
is not sorted, and the blocks are determined by scanning through the rows in order.
When any of the
by.columns
values changes,
this signals the beginning of another block.
Specifying
sort
as
FALSE
is normally used when the data is already sorted,
to avoid an unnecessary sort.
## Divide fuel.frame into blocks defined by the Type column, ## and for each block compute minWeight (the minimum value ## of the Weight column) and blockSize (the number of rows ## in the block). bd.by.group(fuel.frame, "Type", function(df) data.frame(Type=df[1,"Type"], minWeight=min(df$Weight), blockSize=nrow(df)))
## Divide fuel.frame into blocks defined by the Type column, ## and print each of these block data frames. ## Returns NULL. bd.by.group(fuel.frame, "Type", function(df) print(df), output=F)
## Divide fuel.frame into blocks defined by the Type column, ## and print each of these block data frames, ignoring some ## types specified via the args argument. ## Returns NULL. bd.by.group(fuel.frame, "Type", function(df, ignore.types) if (!is.element(df$Type[1], ignore.types)) print(df), output=F, args=list(ignore.types=c("Compact","Large")))