Big Data Processing Options

DESCRIPTION:

Control options used when processing big data objects.

This function requires the bigdata library section to be loaded.

USAGE:

bd.options(...)

REQUIRED ARGUMENTS:

...
a list or vector of character strings may be given as the only argument, or any number of arguments may be in name=value form. In addition, no arguments at all may be given.

VALUE:

If no arguments are given, this returns a list of current values for all options. If a character vector is given as the only argument, a list of current values for the options named in the character vector is returned. If an object of mode "list" is given as the only argument, its components become the values for options with the corresponding names. S-PLUS returns a list of the option values before they were modified. Generally, the list given as an argument is the return value of a previous call to this function. If arguments are given in name=value form, S-PLUS changes the values of the specified options and returns a list of the option values before they were modified.

SIDE EFFECTS:

When options are set, the options function changes a list named bd.options.list in the session frame (frame 0). The components of bd.options.list are all of the currently defined options. If this function is called with either a list as the single argument or with one or more arguments in name=value form, the options specified are changed or created.

DETAILS:

Several options are currently defined:

"print.bdFrame.rows": The maximum number of bdFrame rows to display when printing. The default value is 5.

"print.bdFrame.columns": The maximum number of bdFrame columns to display when printing. The default value is 20.

"print.bdVector.elements": The maximum number of bdVector elements to display when printing. The default value is 30.

"block.size": the maximum number of rows to be processed at a time used when executing big data operations. The default value is 1e9. The actual number of rows processed is determined by this value, adjusted downwards to fit within the value specified by the option "max.block.mb".

"max.block.mb": The maximum number of megabytes used for block-processing buffers. If the specified block size would require too much space, the number of rows is reduced so that the entire buffer is smaller than this size. This prevents unexpected out-of-memory errors when processing wide data with many columns. The default value is 10. The maximum supported value for max.block.mb is 2147, specifying a 2GB buffer size. If bd.options is called to set a larger value, it generates a warning, and sets the value to 2147.

"max.convert.bytes": the maximum size (in bytes) of the big data cache that can be converted to a data.frame. The default value is 1e7. This can be set to prevent conversions that will exceed the S-PLUS memory.

"string.column.width": The default maximum number of characters in string columns, if this cannot be determined automatically. The default value is 32.

"max.levels": The maximum number of levels that can appear in factor columns. The default value is 500. This cannot be set larger than 65535.

"error.on.level.overflow": If T, factor level overflow while writing a factor column will generate an error as soon as it occurs. In any case, this will generate a warning. This can be set to avoid processing large data sets, and only discovering at the end that some of the data has been altered. The default value is F.

"error.on.string.truncation": If T, string truncation while writing strings to a character column will generate an error as soon as it occurs. In any case, this will generate a warning. This can be set to avoid processing large data sets, and only discovering at the end that some of the data has been altered. The default value is F.

"trace": Enables extra tracing of big data operations. This was useful when developing the big data library, but may also be helpful when examining the performance of user functions built on the big data library. If this option is the string "tally", the incremental bd.tally value will be printed for every low-level big data node operation. When examining the performance of a complex expression, this could be used to pinpoint which big data operation is using the most time. If this option is the string "browser" or "browser.before", the function browser() is called before executing each low-level big data node operation, so the user can eximine the program state at that point. The strings browser.after or browser.both specify that browser() should be called after, or both before and after each node operation.

EXAMPLES:

## set block size.
bd.options(block.size=5000)