General Nonparametric Bootstrapping

DESCRIPTION:

Performs nonparametric bootstrapping for a wide scope of statistics and sampling procedures, and summarizes the bootstrap distribution. The bootstrap function is generic (see Methods); method functions can be written to handle specific classes of data. Classes which already have methods for this function include:
, , .

USAGE:

bootstrap(data, statistic, B = 1000, args.stat, 
          group, subject, 
          sampler = samp.bootstrap, seed = .Random.seed, 
          sampler.prob, 
          sampler.args, sampler.args.group, 
          resampleColumns, 
          label, statisticNames, 
          block.size = min(100,B), 
          trace = resampleOptions()$trace, assign.frame1 = F, 
          save.indices, save.group, save.subject, 
          statistic.is.random, 
          group.order.matters = T, 
          order.matters, 
          seed.statistic = 500, 
          L, model.mat, argumentList,  
          observed.indices = 1:n, ...) 

See for further details of arguments marked with "*" (including important capabilities not described here), and for a description of arguments not described below.

REQUIRED ARGUMENTS:

data*
data to be bootstrapped. May be a vector, matrix, data frame, or output from a modeling function like .
statistic*
statistic to be bootstrapped; a function or expression that returns a vector or matrix. It may be a function which accepts data as the first unmatched argument; other arguments may be passed using args.stat.
Or it may be an expression such as mean(x,trim=.2). If data is given by name (e.g. data=x) then use that name in the expression, otherwise (e.g. data=air[,4]) use the name data in the expression. If data is a data frame, the expression may involve variables in the data frame.

OPTIONAL ARGUMENTS:

B*
number of bootstrap resamples to be drawn. This may be a vector, whose sum is the total number of resamples.
args.stat*
if statistic is a function, a list of other arguments, if any, to pass to statistic when calculating the statistic on the resamples, e.g. list(trim=.2). If statistic is an expression, then a list of objects to include in the frame where the expression is evaluated.
group*
vector of length equal to the number of observations in data, for stratified sampling or multiple-sample problems. Sampling is done separately for each group (determined by unique values of this vector). If data is a data frame, this may be a variable in the data frame, or expression involving such variables.
subject*
vector of length equal to the number of observations in data; if present then subjects (determined by unique values of this vector) are resampled rather than individual observations. If data is a data frame, this may be a variable in the data frame, or an expression involving such variables. If group is also present, subject must be nested within group (each subject must be in only one group).
Under certain conditions bootstrap makes resampled subjects unique before calling the statistic.
sampler
function which generates resampling indices. The function generates simple bootstrap resamples. See for other existing samplers and details. May also be an expression such as samp.bootstrap(size = 100) for setting optional arguments to the sampler. See also argument sampler.args, described in .
seed*
seed for generating resampling indices; a legal seed, e.g. an integer between 0 and 1023.
sampler.prob*
list of vectors of probabilities to be used for importance sampling.
label
character, if supplied is used when printing, and as the main title for plotting.
statisticNames
character vector of length equal to the number of statistics calculated; if supplied is used as the statistic names for printing and plotting.
trace
logical flag indicating whether to print messages indicating progress. The default is determined by .
save.indices*
logical flag indicating whether to return the matrix of resampling indices, or value 2 indicating to return compressed indices; by default choose based on the sample size and B.
argumentList*
list of arguments to bootstrap. All arguments except data, statistic, group, and subject may be specified in this list, and their values override the values set by their regular placement in the argument list. See for examples.
...
other argument described in or additional arguments which are passed to methods for bootstrap.

VALUE:

an object of class bootstrap which inherits from resamp. This has components call, observed, replicates, estimate, B, n (the number of observations or subjects), dim.obs, seed.start, and seed.end. Components which may be present include B.missing, weights (see sampler.prob), group, subject, label, defaultLabel, parent.frame (the frame of the caller of bootstrap), indices, compressedIndices, L, Lstar, and others. The data frame estimate has three columns containing the bootstrap estimates of Bias, Mean, and SE. See or for further details.

If the function is interrupted it saves current results (all complete sets of block.size replicates) to .bootstrap.partial.results. This object is nearly the same as if bootstrap were called with a smaller value of B, so many functions that expect an object of class bootstrap will operate correctly. An exception is ; see the help file for a work-around.

The function bootstrap causes creation of the dataset .Random.seed if it does not already exist, otherwise its value is updated.

DETAILS:

See other help files and for details.

REFERENCES:

Davison, A.C. and Hinkley, D.V. (1997), Bootstrap Methods and Their Application, Cambridge University Press.

Efron, B. and Tibshirani, R.J. (1993), An Introduction to the Bootstrap, San Francisco: Chapman & Hall.

A number of technical reports on aspects of the resampling code are found at www.insightful.com/Hesterberg/bootstrap

BUGS:

See .

SEE ALSO:

More details on arguments, including those not described here: , (describes different sampling options).

Bootstrap and other objects: , , .

Print, summarize, plot: , , , ,

Description of a "bootstrap" object, extract parts: , , , , .

Diagnostics: , .

Confidence intervals: , , , , .

Modify a "bootstrap" object: , , , .

For an annotated list of functions in the package, including other high-level resampling functions, see: .

EXAMPLES:

# Bootstrap a mean; demonstrate summary(), plot(), qqnorm() 
bootstrap(stack.loss, mean) 
temp <- bootstrap(stack.loss, mean) 
temp 
summary(temp) 
plot(temp) 
qqnorm(temp) 
 
# Percentiles 
limits.percentile(temp) 
 
# Confidence intervals 
limits.tilt(temp) 
limits.bca(temp) 
limits.bca(temp,detail=T) 
 
# Here the "statistic" argument is an expression, not a function. 
stack <- cbind(stack.loss, stack.x) 
bootstrap(stack, l1fit(stack[,-1], stack[,1])$coef, seed=0) 
 
# Again, but if the data is created on the fly, then 
# use the name "data" in the statistic expression: 
bootstrap(cbind(stack.loss, stack.x), 
          l1fit(data[,-1], data[,1])$coef, seed=0) 
temp <- bootstrap(stack, var)  # Here "statistic" is a function. 
parallel(~ temp$replicates)     # Interesting trellis plot. 
 
# Demonstrate the args.stat argument 
#   without args.stat: 
bootstrap(stack.loss, mean(stack.loss, trim=.2)) 
 
#   statistic is a function: 
bootstrap(stack.loss, mean, args.stat = list(trim=.2)) 
 
#   statistic is an expression, object "h" defined in args.stat 
bootstrap(stack.loss, mean(stack.loss, trim=h), 
          args.stat = list(h=.2)) 
 
# Bootstrap regression coefficients (in 3 equivalent ways). 
fit.lm <- lm(Mileage ~ Weight, fuel.frame) 
bootstrap(fuel.frame, coef(lm(Mileage ~ Weight, fuel.frame)), B = 250, 
          seed = 0) 
bootstrap(fuel.frame, coef(eval(fit.lm$call)), B = 250, seed = 0) 
bootstrap(fit.lm, coef, B = 250, seed = 0) 
 
# Bootstrap a nonlinear least squares analysis 
fit.nls <- nls(vel ~ (Vm * conc)/(K + conc), Puromycin, 
               start = list(Vm = 200, K = 0.1)) 
temp.nls <- bootstrap(Puromycin, coef(eval(fit.nls$call))) 
pairs(temp.nls$rep) 
plot(temp.nls$rep[,1], temp.nls$rep[,2]) 
contour(hist2d(temp.nls$rep[,1], temp.nls$rep[,2])) 
image(hist2d(temp.nls$rep[,1], temp.nls$rep[,2])) 
 
# Jackknife after bootstrap 
jackknifeAfterBootstrap(temp.nls) 
jackknifeAfterBootstrap(temp.nls, stdev) 
 
# Bootstrap the calculation of a covariance matrix 
my.x <- runif(2000) 
my.dat <- cbind(x=my.x, y=my.x+0.5*rnorm(2000)) 
bootstrap(my.dat, var) 
 
# Perform a jackknife analysis. 
jackknife(stack.loss, mean) 
 
## Two-sample problems 
 
# Bootstrap the distribution of the difference of two group means 
#  (group sizes vary across bootstrap samples) 
West <- (as.character(state.region) == "West") 
Income <- state.x77[,"Income"] 
bootstrap(data.frame(Income, West), 
          mean(data[ data[,"West"],"Income"]) - 
          mean(data[!data[,"West"],"Income"])) 
 
# Stratified bootstrapping for difference of group means 
# (resampling is done separately within "West" and "not West", so 
#  group sizes are constant across bootstrap samples) 
bootstrap(Income, mean(Income[West])-mean(Income[!West]), group = West) 
 
# Different sampling mechanisms 
# Permutation distribution for the difference in two group means, 
#  under the hypothesis of one population. 
# Note that either the group or response variable is permuted, not 
# both. 
bootObj <- bootstrap(Income, sampler = samp.permute, 
                     mean(Income[West])-mean(Income[!West])) 
1 - mean(bootObj$replicates < bootObj$observed)  # one-sided p-value 
 
# Balanced bootstrap 
bootstrap(stack.loss, mean, sampler=samp.boot.bal) 
 
# Bootstrapping unadjusted residuals in lm (2 equivalent ways) 
fit.lm <- lm(Mileage~Weight, fuel.frame) 
resids <- resid(fit.lm) 
preds  <- predict(fit.lm) 
bootstrap(resids, lm(resids+preds~fuel.frame$Weight)$coef, B=250, seed=0) 
bootstrap(fit.lm, coef, lmsampler="resid", B=250, seed=0) 
 
# Bootstrapping other fitted models: gam 
fit.gam <-gam(Kyphosis ~ s(Age,4) + Number, family = binomial, 
              data = kyphosis) 
bootstrap(fit.gam, coef, B=100) 
 
# Bootstrap when patients have varying number of cases: 
# sampling by subject 
DF <- data.frame(ID=rep(101:103, c(4,5,6)), x=1:15) 
DF  # Patient 101 has 4 cases, 102 has 5, 103 has 6. 
bootstrap(DF, mean(x), subject=ID) 
 
## Bootstrap bagging: a classification tree 
# The first column of data set kyphosis is the 
# response variable Kyphosis, with values "present" or "absent" 
kyph.pred <- predict(tree(kyphosis, minsize = 5)) 
# The apparent misclassification rate 
n <- numRows(kyphosis) 
mean(kyph.pred[cbind(1:n, kyphosis$Kyphosis)] < .5)  # 0.02469136 
# bootstrap to get an averaged tree and predict on the original data 
my.kyphosis <- kyphosis 
kyph.pred.boot <- bootstrap(kyphosis, predict(tree(kyphosis, 
    minsize = 5), newdata = my.kyphosis), B = 100, seed = 10) 
# The row names for the replicates are made using the row names of the 
# original data and the abbreviated response values. 
rows <- dimnames(kyphosis)[[1]] 
kyph.names <- paste(rows, abbreviate(kyphosis$Kyphosis,5), sep = ".") 
# The apparent misclassification rate for the averaged tree is 
# higher, but more realistic as a measure of predictive error. 
mean(kyph.pred.boot$estimate[kyph.names, "Mean"] < .5)  # 0.03703704 
 
## Run in background 
For(1, temp <- bootstrap(stack.loss, mean, B=1000), wait=F)