Execute S-PLUS Script on Blocks

DESCRIPTION:

Execute a S-PLUS script on blocks of data, with options for reading multiple input datasets and generating multiple output data sets, and processing blocks in different orders.

This function requires the bigdata library section to be loaded.

USAGE:

bd.block.apply(data, FUN, args=NULL, num.outputs=1,
                 test=F, one.block=F, sample=F,
                 sample.size=10000, seed=NULL)

REQUIRED ARGUMENTS:

data
a single bdFrame or data.frame, or a list of them to specify multiple inputs.

OPTIONAL ARGUMENTS:

FUN
string giving S-PLUS code to execute, or a function object with one argument.

The FUN argument is a S-PLUS function that is called to process a data frame. This function itself cannot perform any big data operations, or an error is generated.
num.outputs
number of output datasets to be produced.
test
if TRUE, a test execution is performed to determine properties of the code.
one.block
if TRUE, read all input data as one block. This argument is only used if test is FALSE.
args
use to pass parameters to the script.
sample
if TRUE, perform simple random sampling on each input data to produce a single data block with no more than sample.size rows. This argument is only used if test is FALSE and one.block is TRUE.
sample.size
number of rows in sample, if sample is TRUE.
seed
if the default of NULL, uses a new random seed for sampling every time. If an integer, it uses this for the seed.

VALUE:

returns a list of the script output datasets, or a single dataset if num.outputs is 1. This list will contain data.frame objects if all of the inputs in inputs argument are data.frame objects, otherwise it will return a list of bdFrame objects.

DETAILS:

This function performs transformations on a data set by specifying a script of S-PLUS commands. These scripts can be used to specify very complex processing over large datasets, processing blocks in different orders, or processing datasets in multiple passes. This has been designed so that simple transformations are easy to specify, while allowing access to more complex features when needed.

This function is commonly used to transform an input data set. For example, suppose the input data contains a column named ABC and you want to perform the following value replacement: if a value in ABC is less than 10.0, replace it with the fixed value 10.0. This could be done using the following simple script:

   IM$in1$ABC[IM$in1$ABC<10] <- 10.0
   IM$in1

This script is executed to process each data block in the input data. Each time it is executed, the variable IM$in1 contains a data frame with the values of the input block. The first line finds the rows where ABC is less than 10, and replaces them with 10.0. The second line returns the updated value of IM$in1 as the output from the script. Many simple transformations can be done in this manner.

Here is another example script, where the input columns are copied to the output, along with several new columns. The column ABC is used to create the two new columns TIMES.TWO and PLUS.ONE:

   x <- IM$in1$ABC
   data.frame(IM$in1, data.frame(TIMES.TWO=x*2,PLUS.ONE=x+1))

The body of your S-PLUS script implicitly defines a S-PLUS function with a single argument IM. The IM argument is a list with several named elements that can be accessed within the script; these elements map to the data inputs, among other functions. For example, IM$in1 contains the data from the first input to the script, IM$in2 contains the data from the second input etc. The final value in your script is the return value of the function. You can use the S-PLUS function return to return a value from the middle of the script.

Processing Multiple Data Blocks:

The bdFrame functions are designed to handle a very large number of rows by using block updating algorithms. Some S-PLUS functions can easily be applied to blocks of data, such as row-wise transforms. Other functions need to have all of the data at once, such as most of the built-in modeling functions. For functions requiring all of the data at once, the number of rows that can be handled will be limited by the amount of memory on the machine. Often, it is acceptable to work with a large subset of the data rather than all of the rows.

If the argument one.block is TRUE and sample is FALSE, all of the data for each input is passed to the script in one block. This may cause an error if the data is larger than the available memory.

If the argument one.block is TRUE and sample is TRUE, a sample of the data is taken and passed to the script as one block. If the input contains more rows than sample.size, simple random sampling will be used to reduce the number of rows.

If the argument one.block is FALSE, this function processes large data sets by dividing the data set into multiple data blocks and executing the script to handle each data block. The inputs and outputs of the script could be considered as data streams. At any one point, only a small section of the input data is available. Every time a block of input data is available, it is converted into a S-PLUS data frame, and the script is executed to process it. The size of the blocks is controlled by the block.size option of bd.options.

This leads to a different style of programming than most S-PLUS programmers are used to. Rather than gathering all of the data in one data structure and then processing it, the script must process the data in pieces. Some operations may require scanning through the input data multiple times, using facilities described below. While it may be necessary to reorganize existing S-PLUS code, the advantage is that it is possible to process very large data sets.

Input List Elements:

The following is a list of the named list elements passed into the script function in the list IM:

in1
A data frame or bdFrame with the data from the first script input.

in1.pos
The number of the first row in in1, counting from the beginning of the input stream. For example, if the data was being processed 1000 rows at a time, this would be 1, 1001, 2001, etc., as the script was called multiple times. This can be used to create a new row.number column, with a script like:

   row.nums <- seq(1,len=nrow(IM$in1))+IM$in1.pos-1
   data.frame(ROW.NUM=row.nums, IM$in1)

This value can also be used to trigger a computation to occur at the beginning of the data scan. For example, the following simple script prints out the column names of the input when processing the first block:

   if (IM$in1.pos==1) print(names(IM$in1))
   IM$in1

While it is possible to specify the starting positions using in1.release or in1.pos (and similarly for in2), it is not possible to specify the number of rows to bring in for each input.

in1.last
This is TRUE if the current input data block is the last one from the input stream. Note that the last input block may have zero rows. This value can be used to trigger a computation to be performed at the end of the data scan, such as:

   if (IM$in1.pos==1) cat("first block\n")
   if (IM$in1.last) cat("last block\n")
   IM$in1

If the function argument one.block is TRUE, the value of IM$in1.pos will always be equal to 1, and IM$in1.last will always equal TRUE.

in1.total.rows
This is the total number of rows in the input data stream, if it is known. If it is not known, it is -1. This is generally only known after the data has been scanned once, but it is possible to request that it be available on the first pass by specifying the in1.requirements output value described below.

in2, in2.pos, ...
If the script has more than one input, elements in2, in2.pos, etc. contain the values for the second input, elements in3, in3.pos, etc. contain the values are the third input, and so on. If in1 and in2 have the same number of rows, matching rows are always passed to FUN. If the inputs have different numbers of rows, then matching rows are passed for both in1 and in2 until an the shorter input runs out of rows, then the additional rows for the longer input are passed, while the shorter input shows zero rows.

num.inputs, num.outputs
These values give the number of inputs and outputs of the script. The number of inputs is specified by the number of elements in the data argument to the bd.block.apply function. The number of rows per block does not vary by input; rather, it is the number specified by the bd.block.size option or the max.block.mb option of bd.options. By default, the first block starts with row 1 for all inputs. The starting row can be set to a different value using in1.pos, in2.pos, and so on. The number of outputs is specified by the num.outputs argument.

max.rows
This gives the maximum number of rows possible for any of the input data frames (unless the argument one.block is TRUE).

args
The S-PLUS object passed to bd.block.apply as its args argument. The same value is passed every time the script function is executed.

temp
This can be used to maintain state between different executions of the script. The first time the script is executed, this has a value of NULL. If the temp output element is set to a S-PLUS object (as described below), this object is available as the value of the temp element the next time the script is executed. For example, here is a script that computes and outputs the running sums of the column ABC:

   if (is.null(IM$temp)) { IM$temp <- 0 }
   current.sum <- sum(IM$in1$ABC)+IM$temp
   out1 <- data.frame(ABC.SUM=current.sum)
   list(out1=out1, temp=current.sum)

For each input block, it outputs one row containing the cumulative total for all of the ABC values so far. The temp value is used to track the cumulative total so far. Note how the test " if (is.null(IM$temp))" is used to initialize the temp value.

test
This is TRUE if the script is being executed on dummy data before being applied to the real input data. This is only done if the argument test is TRUE. This test execution allows the script to specify information such as the in1.requirements output value that determines how the script should be executed on the real data. For example, the following script uses the test execution to specify that meta-data about IM$in1 should be passed in as elements of IM (by including "meta.data" in in1.requirements), and that the script should be able to randomly access the input data (by including "random.access"):

   if (IM$test)
     return(list(in1.requirements=c("meta.data","random.access")))
   IM$in1[IM$in1$ABC>=10, , drop=F]

in1.column.string.widths
The value of this list element is a named vector of integers. The length of this vector is the same as the number of columns in the element in1, and the element names are the column names. Each integer is the string width of the input column (for string columns) or NA (for other columns).

in1.column.min
The value of this list element is a named vector of doubles giving the minimum value for each column in the whole input data set. The vector values are NA for non-continuous columns. (This element is only present if the in1.requirements output element contains "meta.data", described below).

in1.column.max
The value of this list element is a named vector of doubles giving the maximum value for each column in the whole input data set. The vector values are NA for non-continuous columns. (This element is only present if the in1.requirements output element contains "meta.data", described below).

in1.column.mean
The value of this list element is a named vector of doubles giving the mean value for each column in the whole input data set. The vector values are NA for non-continuous columns. (This element is only present if the in1.requirements output element contains "meta.data", described below).

in1.column.stdev
The value of this list element is a named vector of doubles giving the standard deviation for each column in the whole input data set. The vector values are NA for non-continuous columns. (This element is only present if the in1.requirements output element contains "meta.data", described below).

in1.column.count.missing
The value of this list element is a named vector of doubles giving the number of missing values for each column in the whole input data set. (This element is only present if the in1.requirements output element contains "meta.data", described below).

in1.column.level.counts
The value of this list element is a list giving the number of times each categorical level appears in each categorical column in the whole input data set. The length of this list is the same as the number of columns in the input IM$in1, and the list element names are the column names. For each categorical column, the corresponding element in this list is a named vector of level counts. The names are the level names, and the values are the counts for each of the categorical levels. For non-categorical columns, the corresponding element of this list is NULL. (This element is only present if the in1.requirements output element contains "level.counts", described below).

Output List Elements:

A S-PLUS script can output one of two things, a data frame or a list. In most of the example scripts provided so far, the script returns a data frame, which is output as the first output of the script. Returning the data frame df is exactly the same as returning list(out1=df). If the script returns a list, it may contain any of the list element names described below.

The following list elements can be used to send output data to multiple outputs and control the continued processing of script:

out1
This value should be a data frame specifying the data to be output from the first output. If the value is specified as NULL, this means that no rows are output at this time, the same as if a data frame with zero rows was specified.

out2, out3, ...
If the script has more than one output, element out2 contains the value for the second output, element out3 contains the value for the third output, and so on.

in1.release, in1.release.all, in1.pos
These values determine which data is read from the first input the next time the script is called. If none of these are specified, then the default is to read in the next data block following the current one. Only one of in1.release, in1.release.all, and in1.pos should be specified at once. If the function argument one.block is TRUE, all of these values are ignored: the entire input data set is always available as IM$in1.

in1.release
This value is used to release fewer than the full number of input rows from the current data block. This can be used to process a sliding window on the input data. For example, assuming a block size of 1000 rows, the following script produces the sum of column ABC for rows 1:1000, then 101:1100, 201:1200, etc. The call to the function min handles the case where the input data frame doesn't have as many rows as expected, which may occur at the end of the data.

   list(out1=data.frame(POS=IM$in1.pos, LEN=nrow(IM$in1),
     WINDOW.SUM=sum(IM$in1$ABC)),
     in1.release=min(100,nrow(IM$in1)))

in1.release.all
Setting in1.release.all=T specifies that the script is done with the input data. If the script continues processing data from other inputs, in1 always has zero rows. This is much more efficient than just reading and ignoring the rest of the data. For example, the following script outputs the first 10 rows from a data stream:

list(out1=IM$in1[1:10, , drop=F], in1.release.all=T)

in1.pos
This value is used to reposition the input data stream for the next read. Specifying in1.pos=1 repositions it to the beginning. Specifying another value allows random access within the input data stream. It can be moved ahead to skip values, or backwards. This is very powerful, but it can be rather tricky to use. It is helpful to use the temp value to keep track of what you are doing. For example, the following is a simple script that outputs two copies of its input data. During the first pass, the temp value is set to the string "first.pass". When processing the last block during this pass, in1.pos is set to 1, and the temp value is set to "second.pass". The value in1.requirements (described below) guarantees that setting in1.pos to 1 works.

   if (IM$test)
     return(list(out1=IM$in1, in1.requirements="multi.pass"))
   if (is.null(IM$temp))
     IM$temp <- "first.pass"
   if (IM$in1.last && IM$temp=="first.pass")
     return(list(out1=IM$in1, in1.pos=1, temp="second.pass"))
   list(out1=IM$in1, temp=IM$temp)

in2.release, in2.release.all, in2.pos, ...inN.release, inN.release.all, inN.pos
List elements for controlling from two to N inputs.

temp
If this is specified, its value is a S-PLUS object that is passed as the input temp value the next time the script is executed. If it is not specified, it is the same as specifying temp=NULL.

done
This is used to specify whether the script is done executing. If this is not given, the script is finished if all of its inputs have been totally consumed, and none of the in1.pos, in2.pos, etc. output values are specified. This should be used with caution; for example, the following simple script never completes processing:

   # warning: this script will never finish!!
   list(done=F)

error
If this is specified, it should be a vector of strings. Each of these strings are printed as an error message. If any errors are specified, the script stops executing. For example, the following script prints an error and stops if more than 2000 rows are processed:

   if (IM$in1.pos>2000)
     return(list(error="too many rows"))
   return(list(out1=NULL))

warning
If this is specified, it should be a vector of strings. Each of these strings are printed as a warning message. Printing warnings do not stop the script from executing.

in1.requirements
This list element (and in2.requirements, etc.) is only read when IM$test=T. If this is set, it should be a vector of strings specifying input requirements for the specified input . Each of the node inputs can specify different requirements.

By specifying input requirements during the IM$test=T test, a script can tell the execution engine to guarantee that the input data stream has certain features. If these requirements are not specified, these features may or may not be available in certain situations, but it is safer to specify them. For example, if one is going to set in1.pos=1 to reset the input data stream to the beginning, it is good practice to set the "multi.pass" input requirement.

The possible strings that may appear in the in1.requirements string vector include the following:

"multi.pass"

If specified, the input block can have its position reset to the beginning with in1.pos=1. If is not specified, resetting it may cause an error.

"random.access"
If specified, the input block position can be reset to any position within the input data stream with in1.pos=NEWPOS. If it is not specified, resetting the block position may cause an error. Note that it is always possible to set in1.pos forward to skip ahead rows.
"total.rows"
If specified, the IM$in1.total.rows input variable contains the correct total number of rows the first time that the script is executed. Otherwise, it may default to -1 until the last row is processed.
"factor.levels"
If this is specified, any factor columns in the input data frame will contain all of the factor levels in the whole input data stream. Otherwise, the set of factor levels may increase as more blocks are read.
"meta.data"
If specified, the IM list contains certain meta-data (min, max, mean, etc.) about the input data set, in the input list elements in1.column.min, in1.column.max, etc.
"level.counts"
If specified, the IM list contains information about the categorical level counts, in the input list element in1.column.level.counts.

out1.column.string.widths
This element is only read during the test pass through the dummy data (i.e., when IM$test=T), or when outputing the first non- NULL data frame. If this is given, it should be a vector of integers whose length is the same size as the number of columns for the output list element out1. Each vector element should be the desired output string width for the corresponding output string column.

out.object
This element is used to return an arbitrary S-PLUS object from bd.block.apply. If this element is given, the element object is saved, and returned as the "out.object" attribute of the bd.block.apply result. Only one "out.object" object is saved: if this element is given in the output list for two blocks, the latter object is used instead of the former one. Thus, this element is normally only specified when processing the last block.

Debugging Hints:

Debugging S-PLUS scripts can be difficult. It is strongly suggested that new scripts be tested and thoroughly debugged on small test sets, before turning them loose on large data sets. In particular, be very careful with scripts that scan through the data multiple times, since it is easy to write code such that it never stops executing.

The script can contain calls to the S-PLUS cat and print functions. For example, the following script copies its input to its output while printing the position and number of rows of each block. This is very useful, particularly when debugging scripts that set in1.pos to skip around the input data stream.

   row.nums <- seq(1,len=nrow(IM$in1))+IM$in1.pos-1
   cat("pos=",IM$in1.pos,"nrow=",nrow(IM$in1),"\n")
   IM$in1

It is also possible to copy intermediate values to permanent S-PLUS variables, using the assign function, as in the following script:

   assign("in1.sav", IM$in1, where=1, imm=T)
   IM$in1

If this script is not executing properly, the saved variable in1.sav can be accessed from S-PLUS.

EXAMPLES:

#Example 1:
# Script: Producing a QQ-Plot for a Column Within Each Data Chunk
# Inputs: 1
# Outputs: 0
# Create a normal qq-plot of the first column for each chunk.
# Avoid creating plots during the test pass through the data
#     by using the condition if(!IM$test).
# Include a reference line.
splus.code <- function(IM) {
  if(!IM$test) {
    qqnorm(IM$in1[,1])
    qqline(IM$in1[,1])
  }
}
data <- fuel.frame
bd.block.apply(data, FUN = splus.code, num.outputs = 0)

#Example 2:
# Script: Fit and Use a Generalized Additive Model
# Inputs: 1
# Outputs: 1
# Fit, print, and plot a gam model.
# Include output columns for residuals and fitted values.
splus.code <- function(IM) {
  if(IM$test) {
    out <- data.frame(IM$in1,
              GAM.fit = rep(0, nrow(IM$in1)),
              GAM.resid=rep(0, nrow(IM$in1)))
  } else {
    assign("temp.df", IM$in1, where = 1, immediate = T)
    form <- as.formula(paste(names(temp.df)[1], "~ ."))
    assign("temp.form", form, where = 1, immediate = T)
    fit <- gam(temp.form, data = temp.df)
    out <- data.frame(temp.df, GAM.fit = fitted(fit), GAM.resid = resid(fit))
    cat("\n\t**** GAM Model for Rows ", IM$in1.pos, " to ",
        IM$in1.pos + nrow(temp.df) - 1, " ****\n")
    print(summary(fit))
    cat("\n")
    java.graph()
    plot(fit)
    remove("temp.df", where=1)
    remove("temp.form", where=1)
  }
  list(out1 = out)
}
data <- kyphosis
bd.block.apply(data, FUN = splus.code)
#Example 3:
# Script: Replace Missing Values in First Column with Average of Other Columns
# Inputs: 1
# Outputs: 1
# Replace missing values with the average of two other columns
splus.code <- function(IM) {
  inds <- is.na(IM$in1[,1])
  if(any(inds) > 0)
    IM$in1[inds, 1] <- (IM$in1[inds,2] + IM$in1[inds,3])/2
  list(out1 = IM$in1)
}
data <- kyphosis
bd.block.apply(data, FUN = splus.code)

#Example 4:
# Script: Use the "missing" library from a full version of S-PLUS.
# Inputs: 1
# Output: 0
splus.code <- function(IM) {
  if(!IM$test) {
    if(IM$in1.pos == 1) {
      library(missing)
      java.graph()
    }
    plot(miss(IM$in1))
  }
}
data <- seriesData(djia)
bd.block.apply(data, FUN = splus.code, num.outputs = 0)

#Example 5:
# Script: Select columns specified via 'args' argument
# Inputs: 1
# Output: 1
splus.code <- function(IM){
     cols <- IM$args
     IM$in1[,cols,drop=F]
}
bd.block.apply(fuel.frame, splus.code, args=1:3)

#Example 6, using 'out.object' element:
# Script: Collect total of Weight column, return as out.object.
# Inputs: 1
# Output: 0
splus.code <- function(IM) {
     total <- if (is.null(IM$temp)) 0 else IM$temp
     total <- total + sum(IM$in1$Weight)
     if (IM$in1.last) {
       # on last block. output total
       list(out.object=total)
     } else {
       # on other blocks, keep running total in temp
       list(temp=total)
     }
}
attr(bd.block.apply(fuel.frame, splus.code, num.outputs=0), "out.object")