Divide Data into Blocks

DESCRIPTION:

Divide a dataset into multiple data blocks, and return a list of these data blocks.

This function requires the bigdata library section to be loaded.

USAGE:

bd.split.by.group(data, by.columns, sort=T,
                   bigdata=is(x,"bdFrame"))

REQUIRED ARGUMENTS:

data
input data set: a bdFrame or data.frame.
by.columns
names or numbers of columns defining how the input data is divided into blocks.

OPTIONAL ARGUMENTS:

sort
if FALSE, do not sort the input data by by.columns first.
bigdata
if TRUE, returns a list of bdFrame objects. If FALSE, this returns a list of data.frame objects. The default uses the type of x to determine which type of objects to return.

VALUE:

A list with one element for every data block in the data, as defined by by.columns. If the argument bigdata is TRUE, the list elements will be bdFrame objects; otherwise, they will be data.frame objects. The returned list has element names constructed from the contents of the by.column values for each block.

DETAILS:

This function divides the input data into blocks defined by the columns by.columns, and returns a list of all of these blocks.

If bigdata is FALSE, the output list elements will be data.frame objects. In this case, if all of the data is too large to fit in memory, an error will occur.

Each unique combination of values in the columns by.columns that appears in the data defines one data block. Normally, these columns contain strings or factors with a limited number of unique values, but this function works with any column type. For example, if one of the columns in by.columns contains numeric data with different values for each row, then the input data will be divided into blocks with one one row each.

If sort is TRUE, the input data is first sorted by the columns in by.columns, so each of the blocks is guaranteed to have unique values for these columns.
If sort is FALSE, the input data is not sorted, and the blocks are determined by scanning through the rows in order.
When any of the by.columns values changes, this signals the beginning of another block. Specify sort as FALSE when the data is already sorted to avoid an unnecessary sort.

SEE ALSO:

, , .

EXAMPLES:

## Divide the data into a list of blocks,
## divided by the values of "Type"
bd.split.by.group(fuel.frame, "Type")