Bootstrap Prediction

DESCRIPTION:

Performs bootstrap estimates of prediction error; i.e., repeatedly fits a given model on bootstrap samples, and then calculates estimates of prediction error. These estimates are generally more accurate than the sample prediction error of the model fitted to the original data set. The bootstrapValidation function is generic (see ); method functions can be written to handle specific classes of data. Classes which already have methods for this function include:
formula

USAGE:

bootstrapValidation(x, <<y or data>>, 
         modelFit, B, group = NULL, subject = NULL, 
         args.modelFit = NULL, 
         predFun = <<see below>>, args.predFun = NULL, 
         passOldData.predFun = F, 
         errFun = <<see below>>, args.errFun = NULL, 
         seed = .Random.seed,  
         label, 
         trace = resampleOptions()$trace, assign.frame1 = F, 
         save.indices = F, 
         save.group = <<see below>>, save.subject = <<see below>>, 
         save.errors = F) 
bootstrapValidation.default(x, y, 
         <<modelFit and subsequent arguments>>) 
bootstrapValidation.formula(x, data, 
         <<modelFit and subsequent arguments>>) 

REQUIRED ARGUMENTS:

x
For bootstrapValidation.default, a data frame or matrix containing the explanatory variables. For bootstrapValidation.formula, a formula object that specifies the model, with the response on the left of a ~ operator and the explanatory terms, separated by + operators, on the right.
y
the response variable.
data
data frame used to fit the model.
modelFit
function that fits the model under consideration.

For bootstrapValidation.formula: the function must accept a formula as its first argument, and have a data argument; e.g. modelFit(x, data=data).

For bootstrapValidation.default: this function must take arguments x and y, not necessarily in that order.

B
number of bootstrap samples used.

OPTIONAL ARGUMENTS:

group
vector of length equal to the number of observations in data, for stratified sampling or multiple-sample problems. Sampling is done separately for each group (determined by unique values of this vector). If data is a data frame, this may be a variable in the data frame, or expression involving such variables.
subject
vector of length equal to the number of observations in data; if present then subjects (determined by unique values of this vector) are resampled rather than individual observations. If data is a data frame, this may be a variable in the data frame, or an expression involving such variables. If group is also present, subject must be nested within group (each subject must be in only one group).
args.modelFit
list of arguments to pass to modelFit when fitting the model.
predFun
function that returns predicted values for a given model object and new data values. The first two arguments to this function are the model object and new data (except see Details below). The default is a version of predict.
args.predFun
list of additional arguments to pass to predFun when calculating predicted values.
passOldData.predFun
logical flag indicating if the prediction algorithm refits the original model. If so, the training data must be passed to the prediction function; this is done with an assignment to frame 1.
errFun
function that computes a measure of error, based on actual values of the response variable and fitted values. The first two arguments to this function are the actual and fitted values (except see Details below). The default computes squared error.
args.errFun
list of arguments to pass to the function given in errFun when calculating the prediction error.
seed
seed for generating resampling indices. May be a legal random number seed or an integer between 0 and 1023 which is passed to set.seed.
label
character, if supplied is used when printing.
trace
logical flag indicating whether the algorithm should print a message indicating which bootstrap sample is currently being processed. The default is set by .
assign.frame1
logical flag indicating whether the resampled data should be assigned to frame 1 before fitting the model. Try assign.frame1=T if all estimates are identical (this is slower).
save.indices
logical flag indicating whether to save the indices. See return component indices below.
save.group, save.subject
logical flags, if TRUE then group and subject vectors, respectively, are saved in the returned object. Both defaults are TRUE if n<=10000.
save.errors
logical flag, if TRUE then the matrix of errors are saved in the returned object.

VALUE:

an object of class bootstrapValidation, with the following components:
call
the call to bootstrapValidation, but with all the arguments explicitly named.
B
the number of bootstrap samples used.
apparent.error
the average prediction error of the original model used to predict the original data.
optimism
the average decrease in error due to overfitting. The same model--a model built on a bootstrap sample--is used to predict both the bootstrap data and the original data, and compute the difference in error.
err632
the prediction error estimate as calculated by the .632 method.
err632plus
the prediction error estimate as calculated by the .632+ method.
seed.start
the initial value of the random seed for generating the resampling indices, in the same format as .Random.seed.
seed.end
the final value of the random seed, after the bootstrap samples have all been generated, in the same format as .Random.seed.
parent.frame
the frame of the caller of bootstrapValidation.
label
optionally, a label to be used for printing.
defaultLabel
a default label constructed from the call, may be used for printing.
group
optionally, the group vector.
subject
optionally, the subject vector.
indices
optionally, a matrix with n rows and B columns, indicating which observations were assigned to each bootstrap sample.
errors
optionally, a matrix with n rows and B columns, containing the errors (as measured by errFun) for each observation and bootstrap sample.

SIDE EFFECTS:

To avoid scoping problems, the model fitting function is assigned to frame 1 using the name modelFit. If passOldData.predFun=T, the prediction algorithm assigns the training data to frame 1 using the name oldData. (Note that passOldData.predFun gets set to T automatically when gam is used.) If assign.frame1=T, the data is assigned to frame 1 using the name of the data frame or the name data. You must be sure that these assignments to frame 1 do not overwrite some quantity of interest stored in frame 1.

DETAILS:

Performs bootstrap estimates of prediction error for a wide scope of models. The algorithm samples by selecting certain rows of a data frame, so this function is not generally applicable to grouped-data problems that use modeling functions like lme and nlme, unless you use the subject variable.

Normally the first two arguments to predFun are the model object and new data. Most methods for predict (the default predFun) satisfy this. However, predict.censorReg currently has first four arguments object, p, q, newdata. To use this, you could either write your own predFun which calls predict.censorReg with arguments in a different order, or supply args.predFun = list(p=c(.1,.5,.9),q=NULL); this results in internal calls of the form predict(model object, new data, p=c(.1,.5,.9), q=NULL). Because named arguments ( p and q) take precedence, the new data will end up being used as the fourth argument to predict.censorReg, as desired.

Similarly, the first two arguments to errFun are normally the actual and fitted values of the response variable, but these may be displaced to later positions by named arguments in args.errFun.

The combination of predFun and errFun, and their arguments, should be appropriate for your model. For example, in a logistic regression ( glm with family=binomial), args.predFun=list(type="response") puts predictions on the probability scale, and errFun could compute a weighted sum of squares. The defaults are appropriate for the usual linear least-squares regression.

REFERENCES:

Efron, B. and Tibshirani, R.J. (1995), "Cross-Validation and the Bootstrap: Estimating the Error Rate of a Prediction Rule," Technical Report (see http://www-stat.stanford.edu/~tibs/research.html)

Efron, B. and Tibshirani, R.J. (1993), An Introduction to the Bootstrap, San Francisco: Chapman & Hall.

SEE ALSO:

, , .

For an annotated list of functions in the package, including other high-level resampling functions, see: .

EXAMPLES:

bootstrapValidation(ozone ~ ., air, lm, B = 40) 
bootstrapValidation(skips ~ ., data = solder.balance, glm, 
  B = 30, args.modelFit = list(family = poisson)) 
 
# stratified sampling 
bootstrapValidation(skips ~ ., data = solder.balance, glm, 
  B = 30, group = Solder, args.modelFit = list(family = poisson)) 
 
# bootstrapValidation.default method 
bootstrapValidation(air$wind, air$ozone, smooth.spline, B=30, predFun = 
  function(object, newdata) predict(object, x = newdata)$y) 
 
# model selection with smooth.spline 
attach(air) 
plot(ozone,temperature) 
tempErr <- rep(NA, 11) 
for(i in 1:11){ 
  cat("model", i, "\n") 
  res <- bootstrapValidation(ozone, temperature, smooth.spline, 
    args.modelFit = list(df = i+1), predFun = 
    function(object, newdata){predict(object, x = newdata)$y}, 
    B = 30) 
  tempErr[i] <- res$err632plus 
  } 
argminErr <- which(tempErr == min(tempErr))[1] + 1 
lines(smooth.spline(ozone,temperature, df = argminErr)) 
# note: this simple example ignores the variability 
# in the bootstrapValidation estimates, and just picks the 
# minimum error as the "winner" 
 
# local regression model 
bootstrapValidation(NOx ~ C * E, data = ethanol, loess, B = 30, 
  args.modelFit = list(span = 1/2, degree = 2, 
  parametrix = "C", drop.square = "C", 
  control = loess.control("direct"))) 
 
# Test if match: 
# 1. supply the prediction function 
bootp1 <- bootstrapValidation(ozone ~ ., air, lm, B = 40, predFun = 
  function(object, newdata, se.fit) predict.lm(object, 
  newdata, se.fit = T)$fit) 
# 2. supply the error function and args.errFun 
#    while still doing the same model 
bootp2 <- bootstrapValidation(ozone ~ ., air, lm, B = 40, errFun = 
  function(y, fitted, dim) ((y - fitted)^dim), 
  args.errFun = list(dim = 2), seed = bootp1$seed.start) 
all.equal(bootp1[-1], bootp2[-1]) 
# match except for calls