This function requires the bigdata library section to be loaded.
bd.block.apply(data, FUN, args=NULL, num.outputs=1, test=F, one.block=F, sample=F, sample.size=10000, seed=NULL)
bdFrame
or
data.frame
,
or a list of them to specify multiple inputs.
FUN
argument is a S-PLUS function
that is called to process a data frame. This function itself
cannot perform any big data operations, or an error is generated.
TRUE
, a test execution is performed to determine
properties of the code.
TRUE
, read all input data as one block.
This argument is only used if
test
is
FALSE
.
TRUE
, perform simple random sampling on each input
data to produce a single data block with no more than
sample.size
rows.
This argument is only used if
test
is
FALSE
and
one.block
is
TRUE
.
sample
, if
sample
is
TRUE
.
NULL
, uses a new random seed for sampling every time.
If an integer, it uses this for the seed.
num.outputs
is 1.
This list will contain
data.frame
objects if all
of the inputs in
inputs
argument are
data.frame
objects,
otherwise it will return a list of
bdFrame
objects.
This function performs transformations on a data set by specifying a script of S-PLUS commands. These scripts can be used to specify very complex processing over large datasets, processing blocks in different orders, or processing datasets in multiple passes. This has been designed so that simple transformations are easy to specify, while allowing access to more complex features when needed.
This function is commonly used to transform an input data set. For
example, suppose the input data contains a column named
ABC
and you want to perform the following value
replacement: if a value in
ABC
is less than 10.0, replace
it with the fixed value 10.0. This could be done using the following
simple script:
IM$in1$ABC[IM$in1$ABC<10] <- 10.0 IM$in1
This script is executed to process each data block in the input data.
Each time it is executed, the variable
IM$in1
contains a
data frame with the values of the input block. The first line finds
the rows where
ABC
is less than 10, and replaces them
with 10.0. The second line returns the updated value of IM$in1 as the
output from the script. Many simple transformations can be done in
this manner.
Here is another example script, where the input columns are copied to
the output, along with several new columns. The column
ABC
is used to create the two new columns
TIMES.TWO
and
PLUS.ONE
:
x <- IM$in1$ABC data.frame(IM$in1, data.frame(TIMES.TWO=x*2,PLUS.ONE=x+1))
The body of your S-PLUS script implicitly defines a S-PLUS function
with a single argument
IM
. The
IM
argument
is a list with several named elements that can be accessed within the
script; these elements map to the data inputs, among other functions.
For example,
IM$in1
contains the data from the first
input to the script,
IM$in2
contains the data from the
second input etc. The final value in your script is the return value
of the function. You can use the S-PLUS function
return
to return a value from the middle of the script.
The
bdFrame
functions are designed to handle
a very large number of rows by using block updating algorithms. Some
S-PLUS functions can easily be applied to blocks of data, such as
row-wise transforms. Other functions need to have all of the data at
once, such as most of the built-in modeling functions. For functions
requiring all of the data at once, the number of rows that can be
handled will be limited by the amount of memory on the machine.
Often, it is acceptable to work with a large subset of the data rather
than all of the rows.
If the argument
one.block
is
TRUE
and
sample
is
FALSE
, all of the data for
each input is passed to the script in one block. This may cause an
error if the data is larger than the available memory.
If the argument
one.block
is
TRUE
and
sample
is
TRUE
, a sample of the data is
taken and passed to the script as one block. If the input contains
more rows than
sample.size
, simple random sampling will
be used to reduce the number of rows.
If the argument
one.block
is
FALSE
, this
function processes large data sets by dividing the data set into
multiple data blocks and executing the script to handle each data
block. The inputs and outputs of the script could be considered as
data streams. At any one point, only a small section of the input
data is available. Every time a block of input data is available, it
is converted into a S-PLUS data frame, and the script is executed to
process it. The size of the blocks is controlled by the
block.size
option of
bd.options
.
This leads to a different style of programming than most S-PLUS programmers are used to. Rather than gathering all of the data in one data structure and then processing it, the script must process the data in pieces. Some operations may require scanning through the input data multiple times, using facilities described below. While it may be necessary to reorganize existing S-PLUS code, the advantage is that it is possible to process very large data sets.
The following is a list of the named list elements passed into the
script function in the list
IM
:
in1
A data frame or
bdFrame
with the
data from the first script input.
in1.pos
The number of the first row in
in1
, counting from the beginning of the input stream.
For example, if the data was being processed 1000 rows at a time, this
would be 1, 1001, 2001, etc., as the script was called multiple times.
This can be used to create a new
row.number
column, with
a script like:
row.nums <- seq(1,len=nrow(IM$in1))+IM$in1.pos-1 data.frame(ROW.NUM=row.nums, IM$in1)
This value can also be used to trigger a computation to occur at the beginning of the data scan. For example, the following simple script prints out the column names of the input when processing the first block:
if (IM$in1.pos==1) print(names(IM$in1)) IM$in1
While it is possible to specify the starting positions using
in1.release
or
in1.pos
(and similarly
for
in2
), it is not possible to specify the number
of rows to bring in for each input.
in1.last
This is
TRUE
if the current
input data block is the last one from the input stream. Note that the
last input block may have zero rows. This value can be used to
trigger a computation to be performed at the end of the data scan,
such as:
if (IM$in1.pos==1) cat("first block\n") if (IM$in1.last) cat("last block\n") IM$in1
If the function argument
one.block
is
TRUE
,
the value of
IM$in1.pos
will always be equal to 1, and
IM$in1.last
will always equal
TRUE
.
in1.total.rows
This is the total number of rows in the
input data stream, if it is known. If it is not known, it is
-1
. This is generally only known after the data has been
scanned once, but it is possible to request that it be available on
the first pass by specifying the
in1.requirements
output
value described below.
in2
,
in2.pos
, ...
If the script has more
than one input, elements
in2
,
in2.pos
,
etc. contain the values for the second input, elements
in3
,
in3.pos
, etc. contain the values are
the third input, and so on. If
in1
and
in2
have the same number of rows, matching rows are always passed to
FUN
.
If the inputs have different numbers of rows, then matching rows are passed
for both
in1
and
in2
until an the shorter input
runs out of rows, then the additional rows for the longer input are passed,
while the shorter input shows zero rows.
num.inputs
,
num.outputs
These values give
the number of inputs and outputs of the script. The number of inputs
is specified by the number of elements in the
data
argument to the
bd.block.apply
function. The number of
rows per block does not vary by input; rather, it is the number
specified by the
bd.block.size
option or the
max.block.mb
option of
bd.options
. By default, the first block starts
with row 1 for all inputs. The starting row can be set to a different
value using
in1.pos
,
in2.pos
, and so on.
The number of outputs is specified by the
num.outputs
argument.
max.rows
This gives the maximum number of rows
possible for any of the input data frames (unless the argument
one.block
is
TRUE
).
args
The S-PLUS object passed to
bd.block.apply
as its
args
argument. The
same value is passed every time the script function is executed.
temp
This can be used to maintain state between
different executions of the script. The first time the script is
executed, this has a value of
NULL
. If the
temp
output element is set to a S-PLUS object (as
described below), this object is available as the value of the
temp
element the next time the script is executed. For
example, here is a script that computes and outputs the running sums
of the
column
ABC:
if (is.null(IM$temp)) { IM$temp <- 0 } current.sum <- sum(IM$in1$ABC)+IM$temp out1 <- data.frame(ABC.SUM=current.sum) list(out1=out1, temp=current.sum)
For each input block, it outputs one row containing the cumulative
total for all of the
ABC
values so far. The
temp
value is used to track the cumulative total so far.
Note how the test "
if (is.null(IM$temp))
" is used to
initialize the
temp
value.
test
This is
TRUE
if the script is being
executed on dummy data before being applied to the real input data.
This is only done if the argument
test
is
TRUE
. This test execution allows the script to specify
information such as the
in1.requirements
output value
that determines how the script should be executed on the real data.
For example, the following script uses the test execution to specify
that meta-data about
IM$in1
should be passed in as
elements of
IM
(by including
"meta.data"
in
in1.requirements
), and that the script should be able to
randomly access the input data (by including
"random.access"
):
if (IM$test) return(list(in1.requirements=c("meta.data","random.access"))) IM$in1[IM$in1$ABC>=10, , drop=F]
in1.column.string.widths
The value of this list
element is a named vector of integers. The length of this vector is
the same as the number of columns in the element
in1
, and
the element names are the column names. Each integer is the string
width of the input column (for string columns) or
NA
(for
other columns).
in1.column.min
The value of this list element is a
named vector of doubles giving the minimum value for each column in
the whole input data set. The vector values are
NA
for
non-continuous columns. (This element is only present if the
in1.requirements
output element contains
"meta.data"
, described below).
in1.column.max
The value of this list element is a
named vector of doubles giving the maximum value for each column in
the whole input data set. The vector values are
NA
for
non-continuous columns. (This element is only present if the
in1.requirements
output element contains
"meta.data"
, described below).
in1.column.mean
The value of this list element is a
named vector of doubles giving the mean value for each column in the
whole input data set. The vector values are
NA
for
non-continuous columns. (This element is only present if the
in1.requirements
output element contains
"meta.data"
, described below).
in1.column.stdev
The value of this list element is a
named vector of doubles giving the standard deviation for each column
in the whole input data set. The vector values are
NA
for non-continuous columns. (This element is only present if the
in1.requirements
output element contains
"meta.data"
, described below).
in1.column.count.missing
The value of this list
element is a named vector of doubles giving the number of missing
values for each column in the whole input data set. (This element is
only present if the
in1.requirements
output element
contains
"meta.data"
, described below).
in1.column.level.counts
The value of this list element
is a list giving the number of times each categorical level appears in
each categorical column in the whole input data set. The length of
this list is the same as the number of columns in the input
IM$in1
, and the list element names are the column names.
For each categorical column, the corresponding element in this list is
a named vector of level counts. The names are the level names, and
the values are the counts for each of the categorical levels. For
non-categorical columns, the corresponding element of this list is
NULL
. (This element is only present if the
in1.requirements
output element contains
"level.counts"
, described below).
A S-PLUS script can output one of two things, a data frame or a list.
In most of the example scripts provided so far, the script returns a
data frame, which is output as the first output of the script.
Returning the data frame
df
is exactly the same as
returning
list(out1=df)
. If the script returns a list,
it may contain any of the list element names described below.
The following list elements can be used to send output data to multiple outputs and control the continued processing of script:
out1
This value should be a data frame specifying the
data to be output from the first output. If the value is specified as
NULL
, this means that no rows are output at this time,
the same as if a data frame with zero rows was specified.
out2, out3, ...
If the script has more than one output,
element
out2
contains the value for the second output,
element
out3
contains the value for the third output, and
so on.
in1.release, in1.release.all, in1.pos
These values
determine which data is read from the first input the next time the
script is called. If none of these are specified, then the default is
to read in the next data block following the current one. Only one of
in1.release
,
in1.release.all
, and
in1.pos
should be specified at once. If the function
argument
one.block
is
TRUE
, all of these
values are ignored: the entire input data set is always available as
IM$in1
.
in1.release
This value is used to release fewer than
the full number of input rows from the current data block. This can
be used to process a sliding window on the input data. For example,
assuming a block size of 1000 rows, the following script produces the
sum of column
ABC
for rows 1:1000, then 101:1100,
201:1200, etc. The call to the function
min
handles the
case where the input data frame doesn't have as many rows as expected,
which may occur at the end of the data.
list(out1=data.frame(POS=IM$in1.pos, LEN=nrow(IM$in1), WINDOW.SUM=sum(IM$in1$ABC)), in1.release=min(100,nrow(IM$in1)))
in1.release.all
Setting
in1.release.all=T
specifies that the script is done with the input data. If the script
continues processing data from other inputs,
in1
always
has zero rows. This is much more efficient than just reading and
ignoring the rest of the data. For example, the following script
outputs the first 10 rows from a data stream:
list(out1=IM$in1[1:10, , drop=F], in1.release.all=T)
in1.pos
This value is used to reposition the input
data stream for the next read. Specifying
in1.pos=1
repositions it to the beginning. Specifying another value allows
random access within the input data stream. It can be moved ahead to
skip values, or backwards. This is very powerful, but it can be
rather tricky to use. It is helpful to use the
temp
value to keep track of what you are doing. For example, the following
is a simple script that outputs two copies of its input data. During
the first pass, the temp value is set to the string
"first.pass"
. When processing the last block during this
pass,
in1.pos
is set to 1, and the
temp
value is set to
"second.pass"
. The value
in1.requirements
(described below) guarantees that
setting
in1.pos
to 1 works.
if (IM$test) return(list(out1=IM$in1, in1.requirements="multi.pass")) if (is.null(IM$temp)) IM$temp <- "first.pass" if (IM$in1.last && IM$temp=="first.pass") return(list(out1=IM$in1, in1.pos=1, temp="second.pass")) list(out1=IM$in1, temp=IM$temp)
in2.release, in2.release.all, in2.pos, ...inN.release,
inN.release.all, inN.pos
List elements for controlling from
two to N inputs.
temp
If this is specified, its value is a S-PLUS
object that is passed as the input
temp
value the next
time the script is executed. If it is not specified, it is the same
as specifying
temp=NULL
.
done
This is used to specify whether the script is
done executing. If this is not given, the script is finished if all
of its inputs have been totally consumed, and none of the
in1.pos
,
in2.pos
, etc. output values are
specified. This should be used with caution; for example, the
following simple script never completes processing:
# warning: this script will never finish!! list(done=F)
error
If this is specified, it should be a vector of
strings. Each of these strings are printed as an error message. If
any errors are specified, the script stops executing. For example,
the following script prints an error and stops if more than 2000 rows
are processed:
if (IM$in1.pos>2000) return(list(error="too many rows")) return(list(out1=NULL))
warning
If this is specified, it should be a vector of
strings. Each of these strings are printed as a warning message.
Printing warnings do not stop the script from executing.
in1.requirements
This list element (and
in2.requirements
, etc.) is only read when
IM$test=T
. If this is set, it should be a vector of
strings specifying input requirements for the specified input . Each
of the node inputs can specify different requirements.
By specifying input requirements during the IM$test=T test, a script
can tell the execution engine to guarantee that the input data stream
has certain features. If these requirements are not specified, these
features may or may not be available in certain situations, but it is
safer to specify them. For example, if one is going to set
in1.pos=1
to reset the input data stream to the
beginning, it is good practice to set the
"multi.pass"
input requirement.
The possible strings that may appear in the
in1.requirements
string vector include the following:
"multi.pass"
If specified, the input block can
have its position reset to the beginning with
in1.pos=1
.
If is not specified, resetting it may cause an error.
"random.access"
in1.pos=NEWPOS
. If it is not specified, resetting
the block position may cause an error. Note that it is always
possible to set
in1.pos
forward to skip ahead rows."total.rows"
IM$in1.total.rows
input variable contains the correct
total number of rows the first time that the script is executed.
Otherwise, it may default to -1 until the last row is processed."factor.levels"
"meta.data"
IM
list contains certain meta-data (min, max, mean, etc.) about the input
data set, in the input list elements
in1.column.min
,
in1.column.max
, etc."level.counts"
IM
list contains information about the categorical level
counts, in the input list element
in1.column.level.counts
.
out1.column.string.widths
This element is only read
during the test pass through the dummy data (i.e., when
IM$test=T
), or when outputing the first
non-
NULL
data frame. If this is given, it should be a
vector of integers whose length is the same size as the number of
columns for the output list element
out1
. Each vector
element should be the desired output string width for the
corresponding output string column.
out.object
This element is used to return an arbitrary
S-PLUS object from
bd.block.apply
. If this element is
given, the element object is saved, and returned as the
"out.object"
attribute of the
bd.block.apply
result. Only one
"out.object"
object is saved: if this
element is given in the output list for two blocks, the latter object
is used instead of the former one. Thus, this element is normally
only specified when processing the last block.
Debugging S-PLUS scripts can be difficult. It is strongly suggested that new scripts be tested and thoroughly debugged on small test sets, before turning them loose on large data sets. In particular, be very careful with scripts that scan through the data multiple times, since it is easy to write code such that it never stops executing.
The script can contain calls to the S-PLUS cat and print functions.
For example, the following script copies its input to its output while
printing the position and number of rows of each block. This is very
useful, particularly when debugging scripts that set
in1.pos
to skip around the input data stream.
row.nums <- seq(1,len=nrow(IM$in1))+IM$in1.pos-1 cat("pos=",IM$in1.pos,"nrow=",nrow(IM$in1),"\n") IM$in1
It is also possible to copy intermediate values to permanent S-PLUS
variables, using the
assign
function, as in the following
script:
assign("in1.sav", IM$in1, where=1, imm=T) IM$in1
If this script is not executing properly, the saved variable
in1.sav
can be accessed from S-PLUS.
#Example 1: # Script: Producing a QQ-Plot for a Column Within Each Data Chunk # Inputs: 1 # Outputs: 0 # Create a normal qq-plot of the first column for each chunk. # Avoid creating plots during the test pass through the data # by using the condition if(!IM$test). # Include a reference line. splus.code <- function(IM) { if(!IM$test) { qqnorm(IM$in1[,1]) qqline(IM$in1[,1]) } } data <- fuel.frame bd.block.apply(data, FUN = splus.code, num.outputs = 0)
#Example 2: # Script: Fit and Use a Generalized Additive Model # Inputs: 1 # Outputs: 1 # Fit, print, and plot a gam model. # Include output columns for residuals and fitted values. splus.code <- function(IM) { if(IM$test) { out <- data.frame(IM$in1, GAM.fit = rep(0, nrow(IM$in1)), GAM.resid=rep(0, nrow(IM$in1))) } else { assign("temp.df", IM$in1, where = 1, immediate = T) form <- as.formula(paste(names(temp.df)[1], "~ .")) assign("temp.form", form, where = 1, immediate = T) fit <- gam(temp.form, data = temp.df) out <- data.frame(temp.df, GAM.fit = fitted(fit), GAM.resid = resid(fit)) cat("\n\t**** GAM Model for Rows ", IM$in1.pos, " to ", IM$in1.pos + nrow(temp.df) - 1, " ****\n") print(summary(fit)) cat("\n") java.graph() plot(fit) remove("temp.df", where=1) remove("temp.form", where=1) } list(out1 = out) } data <- kyphosis bd.block.apply(data, FUN = splus.code)
#Example 3: # Script: Replace Missing Values in First Column with Average of Other Columns # Inputs: 1 # Outputs: 1 # Replace missing values with the average of two other columns splus.code <- function(IM) { inds <- is.na(IM$in1[,1]) if(any(inds) > 0) IM$in1[inds, 1] <- (IM$in1[inds,2] + IM$in1[inds,3])/2 list(out1 = IM$in1) } data <- kyphosis bd.block.apply(data, FUN = splus.code)
#Example 4: # Script: Use the "missing" library from a full version of S-PLUS. # Inputs: 1 # Output: 0 splus.code <- function(IM) { if(!IM$test) { if(IM$in1.pos == 1) { library(missing) java.graph() } plot(miss(IM$in1)) } } data <- seriesData(djia) bd.block.apply(data, FUN = splus.code, num.outputs = 0)
#Example 5: # Script: Select columns specified via 'args' argument # Inputs: 1 # Output: 1 splus.code <- function(IM){ cols <- IM$args IM$in1[,cols,drop=F] } bd.block.apply(fuel.frame, splus.code, args=1:3)
#Example 6, using 'out.object' element: # Script: Collect total of Weight column, return as out.object. # Inputs: 1 # Output: 0 splus.code <- function(IM) { total <- if (is.null(IM$temp)) 0 else IM$temp total <- total + sum(IM$in1$Weight) if (IM$in1.last) { # on last block. output total list(out.object=total) } else { # on other blocks, keep running total in temp list(temp=total) } } attr(bd.block.apply(fuel.frame, splus.code, num.outputs=0), "out.object")