General Nonparametric Bootstrapping

DESCRIPTION:

Performs bootstrap resampling of observations from specified data, for specified statistics, and summarizes the bootstrap distribution.

USAGE:

bootstrap(data, statistic, B=1000, args.stat=NULL, 
          group=NULL, sampler=samp.boot.mc, seed=.Random.seed,  
          sampler.setup, sampler.wrapup, block.size=min(100,B),  
          trace=T, assign.frame1=F, save.indices=F,  
          statistic.is.random, seed.statistic=500) 

REQUIRED ARGUMENTS:

data
data to be bootstrapped. May be a vector, matrix, or data frame.
statistic
statistic to be bootstrapped: a function or expression that returns a vector or matrix. It may be a function which accepts data as the first unnamed argument; other arguments may be passed to the function through args.stat. Or it may be an expression such as mean(x, trim=.2). If the data object has a name (e.g. data=x) then use that name in the expression, otherwise (e.g. data=df$y) use the name data in the expression, e.g. mean(data, trim=.2). /.be passed to the function through args.stat, /.where x is the name of the object passed as the data argument. /.If the data argument is constructed within the call to bootstrap, /.then the data should be referred to as data in the expression. See examples below.

OPTIONAL ARGUMENTS:

B
number of bootstrap resamples to be drawn. We recommend at least 250 to estimate standard errors and 1000 to estimate percentiles. This may be a vector, whose sum is the total number of resamples.
args.stat
list of other arguments, if any, passed to statistic when calculating the statistic on the resamples.
group
allows stratified sampling and bootstrapping multi-sample problems. The unique values of this vector determine groups. For each resample, a bootstrap sample is drawn separately for each group, and the observations are combined to give the full resample. The statistic is calculated for the resample as a whole.
sampler
function which generates resampling indices. The samp.boot.mc function generates simple Monte Carlo resamples. The samp.boot.bal function performs balanced bootstrapping. The user may write additional functions.
seed
seed for generating resampling indices. May be a legal random number seed or an integer between 0 and 1000 which will be passed to set.seed.
sampler.setup
function which performs initialization before calling sampler(). By default, sets the random number seed.
sampler.wrapup
function which performs wrapup after calling sampler(). By default, records the ending value of the random number seed.
block.size
control variable specifying the number of resamples to calculate at once. bootstrap uses an lapply() within a for() loop (within two nested for() loops if B is a vector). For small sample sizes, a single lapply() is reasonable, while for large sample sizes, a series of separate lapply()s is more efficient.
trace
logical flag indicating whether the algorithm should print a message indicating which set of replicates is currently being drawn.
assign.frame1
logical flag indicating whether the resampled data should be assigned to frame 1 before evaluating the statistic. This may be necessary if the statistic is reevaluating the call of a model object. If all bootstrap estimates are identical, try setting assign.frame1=T. Note that this will slow down the algorithm.
save.indices
logical flag indicating whether to save the matrix of resampling indices.
statistic.is.random
logical flag indicating whether the statistic itself performs randomization, in which case we need to keep track of two parallel seeds, one for the sampling and one for the statistic. If this argument is missing, the algorithm will attempt to determine if the statistic involves randomization by evaluating it and checking whether the random seed has changed.
seed.statistic
random number seed to be used for the statistic if it uses randomization.

VALUE:

an object of class bootstrap which inherits from resamp. This has components call, observed, replicates, estimate, B, n, dim.obs, group , seed.start, and seed.end. The data frame estimate has three columns containing the bootstrap estimates of Bias, Mean, and SE.

SIDE EFFECTS:

If assign.frame1=T, the user must be sure that this assignment does not overwrite some quantity of interest stored in frame 1. If the function is interrupted it will save current results (all complete sets of block.size replicates) to .bootstrap.partial.results. This object is nearly the same as if bootstrap were called with a smaller value of B, so many functions that expect an object of class bootstrap will operate correctly. An exception is update; see the help file for update.bootstrap for a work-around.

The function bootstrap causes creation of the dataset .Random.seed if it does not already exist, otherwise its value is updated.

DETAILS:

Performs nonparametric bootstrapping of observations for a wide scope of statistics and expressions. Multisample bootstrapping is supported through the group argument. Balanced bootstrapping ( sampler=samp.boot.bal) gives balancing done separately within each group of resamples. This is biased, of order O(1/ block.size). It is useful for estimating the bias of a statistic, but should be avoided for estimating standard errors or confidence limits.

REFERENCES:

Davison, A.C. and Hinkley, D.V. (1997). Bootstrap Methods and Their Application. Cambridge University Press.

Efron, B. and Tibshirani, R. J. (1993). An Introduction to the Bootstrap. San Francisco: Chapman & Hall.

Shao, J. and Tu, D. (1995). The Jackknife and Bootstrap. New York: Springer-Verlag.

SEE ALSO:

, , , , , , , , , , .

EXAMPLES:

# Bootstrap a mean; demonstrate summary(), plot(), qqnorm() 
bootstrap(stack.loss, mean) 
temp <- bootstrap(stack.loss, mean) 
temp 
summary(temp) 
plot(temp) 
qqnorm(temp) 

# Confidence intervals 
limits.emp(temp) 
limits.bca(temp) 
limits.bca(temp,detail=T) 

# Here statistic argument is a call, not a function. 
stack <- cbind(stack.loss,stack.x) 
bootstrap(stack,l1fit(stack[,-1],stack[,1])$coef, seed=0) 

# Again, but construct the data in the call 
bootstrap(cbind(stack.loss,stack.x), 
          l1fit(data[,-1],data[,1])$coef, seed=0) 
temp <- bootstrap(stack,var)  # Here statistic argument is a function. 
parallel(~ temp$rep)     # Interesting trellis plot. 

# Bootstrap regression coefficients (in 2 different ways). 
fit.lm <- lm(Mileage~Weight,fuel.frame) 
bootstrap(fuel.frame, coef(lm(Mileage~Weight,fuel.frame))) 
bootstrap(fuel.frame, coef(eval(fit.lm$call))) 

# Bootstrap a nonlinear least squares analysis 
fit.nls <- nls(vel ~ (Vm * conc)/(K + conc), Puromycin,  
               start = list(Vm = 200, K = 0.1)) 
temp.nls <- bootstrap(Puromycin,coef(eval(fit.nls$call)), B=1000) 
pairs(temp.nls$rep) 
plot(temp.nls$rep[,1],temp.nls$rep[,2]) 
contour(hist2d(temp.nls$rep[,1],temp.nls$rep[,2])) 
image(hist2d(temp.nls$rep[,1],temp.nls$rep[,2])) 

# Jackknife after bootstrap 
jack.after.bootstrap(temp.nls) 
jack.after.bootstrap(temp.nls, stdev) 

# Bootstrap the calculation of a covariance matrix 
my.x <- runif(2000) 
my.dat <- cbind(x=my.x,y=my.x+0.5*rnorm(2000)) 
bootstrap(my.dat,var,B=1000) 

# Perform a jackknife analysis. 
jackknife(stack.loss,mean) 

## Two-sample problems 
# Bootstrap the distribution of the difference of two group means 
#  (group sizes will vary across bootstrap samples) 
West <- (as.character(state.region) == "West") 
Income <- state.x77[,"Income"] 
bootstrap(cbind(Income, West), 
          mean(data[ data[,"West"],"Income"]) - 
          mean(data[!data[,"West"],"Income"])) 

# Stratified bootstrapping for difference of group means 
bootstrap(Income, mean(Income[West])-mean(Income[!West]), group = West) 

## Different sampling mechanisms 
# Permutation distribution for the difference in two group means, 
#  under the hypothesis of one population. 
# Note that either the group or response variable is permuted, not both. 
bootObj <- bootstrap(Income, sampler = samp.permute, 
                     mean(Income[West])-mean(Income[!West])) 
1 - mean(bootObj$replicates < bootObj$observed)  # one-sided p-value 

# Balanced bootstrap 
bootstrap(stack.loss, mean, sampler=samp.boot.bal) 
# Bootstrapping unadjusted residuals in lm. 
fit.lm <- lm(Mileage~Weight, fuel.frame) 
resids <- resid(fit.lm) 
preds  <- predict(fit.lm) 
bootstrap(resids, lm(resids+preds~fuel.frame$Weight)$coef) 

# Bootstrap when patients have varying number of cases.
DF <- data.frame(ID=rep(101:103, c(4,5,6)), x=1:15)
DF  # Patient 101 has 4 cases, 102 has 5, 103 has 6.
index.list <- split(1:nrow(DF), DF$ID)
# The "data" argument to bootstrap is index.list; each element
# of this list corresponds to one patient.
#
# If statistic is a function, it must take (a resampled version of)
# index.list as its first argument, then extract the corresponding
# rows of the "real" data DF:
stat <- function(index.list) myRealFunction(DF[unlist(index.list),])
bootstrap(index.list, stat)
#
# If statistic is an expression, it should use index.list
bootstrap(index.list, myRealFunction(DF[unlist(index.list),]))

## Run in background 
For(1, temp <- bootstrap(stack.loss, mean, B=1000), wait=F)