Bootstrap arguments

DESCRIPTION:

Detailed descriptions of arguments to bootstrap; the same arguments are used in many other resampling functions. A shorter description, omitting some arguments, is found in .

USAGE:

bootstrap(data, statistic, B = 1000, args.stat, 
          group, subject, 
          sampler = samp.bootstrap, seed = .Random.seed, 
          sampler.prob, 
          sampler.args, sampler.args.group, 
          resampleColumns, 
          label, 
          statisticNames, 
          block.size = min(100,B), 
          trace = resampleOptions()$trace, 
          assign.frame1 = F,  
          save.indices = <<see below>>, 
          save.group = <<see below>>, 
          save.subject = <<see below>>, 
          statistic.is.random, 
          group.order.matters = T, 
          order.matters, 
          seed.statistic = 500, 
          L = NULL, model.mat, argumentList,  
          observed.indices = 1:n, ...) 

REQUIRED ARGUMENTS:

data
data to be bootstrapped. May be a vector, matrix, or data frame. Bootstrapping is generally faster if data is an ordinary matrix or vector rather than a data frame.

May also be the output of a modeling function like lm; see Details below.

statistic
statistic to be bootstrapped: a function or expression that returns a vector or matrix (not a data frame).

The statistic may be a function (e.g. mean) which accepts data as the first unmatched argument; other arguments may be passed to the function through args.stat, e.g. args.stat=list(trim=.2).

By "unmatched", we mean the first argument that is not given by name in args.stat. E.g. if your function is f(a,b,...) and you specify args.stat=list(a=3) then the data is passed as the b argument to your function.

Or the statistic may be an expression such as mean(x, trim=.2). If the data object has a simple name (e.g. data=x) then use that name ( "x") in the expression, otherwise (e.g. data=df$y) use the name "data" in the expression, e.g. mean(data, trim=.2).

If data is a data frame, the expression may involve variables in the data frame, e.g. data=air,statistic=mean(ozone/wind).

The following types of expressions are not allowed: an expression that returns a function or a function name (e.g. statistic = object$fun, where object$fun contains the function function(x) mean(x) or the name mean); an expression that returns an expression (e.g. statistic = object$stat, where object$stat contains the expression mean(x, trim=.2). On the other hand, statistic may be the name of a function or an expression. For example, statistic = fun, where fun contains function(x) mean(x), or statistic = stat.expr, where stat.expr contains mean(x, trim=.2).

OPTIONAL ARGUMENTS:

B
number of bootstrap resamples to be drawn. We recommend at least 250 to estimate standard errors and 1000 to estimate percentiles, or at least 2000 for BCa confidence intervals (the last figure is a topic of current research and may change).
This may be a vector, whose sum is the total number of resamples. In this case the first B[1] samples are generated, then the next B[2], and so on; also see the sampler.prob argument.
args.stat
if statistic is a function, a list of other arguments, if any, to pass to statistic when calculating the statistic on the resamples. The names of the list are used as argument names.

If statistic is an expression, then args.stat a list of objects which should be included in the frame where the expression is evaluated; names of the list are used as object names. e.g. statistic=mean(x,trim=alpha),args.stat=list(alpha=alphaVector[i]) indicates that alpha is given the value of alphaVector[i] in a place it can be found when the statistic is evaluated.

group
vector of length equal to the number of observations in data, for stratified sampling or multiple-sample problems. Sampling is done separately for each group (determined by unique values of this vector), and indices are combined to create a full resample. The statistic is calculated for the resample as a whole.

If data is a data frame, this may be a variable in the data frame, or an expression involving such variables, e.g. data=lung, group=sex or data=lung, group=age<50.

subject
vector of length equal to the number of observations in data; if present then subjects (determined by unique values of this vector) are resampled rather than individual observations. If data is a data frame, this may be a variable in the data frame, or an expression involving such variables. If group is also present this must be nested within group (a single subject may not be present in multiple groups).

If subject is the name of a variable in the data frame (for example data=Orthodont, subject=Subject), then bootstrap makes resampled subjects unique; that is, duplicated subjects in a given resample are assigned distinct subject values in the resampled data frame before the statistic is evaluated; this is useful for longitudinal and other modeling where the statistic expects subjects to have unique values.

Unique subject values are not assigned if subject is not a variable in the data frame, or if the subject variable is not referred to solely by name; (e.g. subject=Orthodont$Subject subject=Orthodont[,3], or subject=Orthodont[,"Subject"])

sampler
function which generates resampling indices. The function generates simple bootstrap resamples. See for other existing samplers, and for details on writing your own sampler.

sampler may also be an expression such as samp.bootstrap(size = 100) for setting optional arguments to the sampler. Arguments set in this way override those set by sampler.args. If you do this, do not include the n and B arguments to the sampler; they are generated automatically.

seed
seed for generating resampling indices; an integer between 0 and 1023, or other legal input to .
sampler.prob
NULL, vector of probabilities of length n (the number of observations or subjects), or list of the same length as B, each of whose elements is NULL or a vector of length n; the jth element of this list is used for B[j] samples.

This argument is used to do importance sampling. Sampling is done with specified probabilities, but bootstrap will also create a vector weights which is used when computing estimates (mean, bias, estimates, quantiles, etc.) to counteract the importance sampling bias. The result is that all estimates are for a target distribution of sampling without replacement. In the long run you'll get the same results using importance sampling as with equal-probability sampling; in the short term there may be less Monte Carlo variability, with appropriately chosen probabilities.

To get estimates for other target distributions (if you want bootstrap distributions that correspond to weighted empirical distributions) use , as a post-processing step; this may be done whether or not you specified sampler.prob.

sampler.args
list of additional arguments to pass to sampler. An alternative to passing sampler.args is to give the arguments when calling sampler, see above.
sampler.args.group
list of length equal to the number of groups. Each component is a list (possibly NULL), containing additional arguments to pass to sampler for that group. The list sampler.args.group may be named, in which case the names must match the unique values of argument group. Otherwise the list is assumed to be ordered with respect to the sorted, unique values of group. Arguments sampler.args and sampler.args.group may be used simultaneously, in which case the values from sampler.args.group take precedence. This is ignored if not sampling by group.

Suppose you are doing stratified sampling, say with strata sizes 50 and 70, and that you want bootstrap samples of size 49 and 69 (to avoid downward bias in standard errors). You may do this using sampler.args = list(list(size=49), list(size=49)). Alternately, you may use sampler.args=list(reduceSize=1).

resampleColumns
numerical, logical, or character, for subscripting columns of the data. If supplied, then only those columns of the data are resampled. This is useful for permutation tests; for example, for a permutation test of the correlation between two variables, permute only one of them.
label
character, if supplied is used when printing, and as the main title for plotting. Otherwise a default label is used when plotting, and no label is used when printing.
statisticNames
character vector of length equal to the number of statistics calculated; if supplied is used as the statistic names for printing and plotting.
block.size
control variable specifying the number of resamples to calculate at once. bootstrap uses nested for() loops; an outermost loop if B is a vector, a loop over blocks, in which all indices for block.size resamples are generated simultaneously, and an inner loop in which the statistic is calculated for each resample. The tradeoff is that if block.size*n is too large then the matrix of resampling indices may be large, while if block.size is small then random number generators are called more often, which entails extra overhead.

The block.size argument also affects the quality of some samplers. For example, balanced bootstrapping gives balancing done separately within each group of resamples. This is biased, of order O(1/ block.size), so increasing block.size reduces the bias.

trace
logical flag indicating whether the algorithm should print a message indicating which set of replicates is currently being drawn. The default is determined by .
assign.frame1
logical flag indicating whether the resampled data should be assigned to frame 1 before evaluating the statistic. This may be necessary if the statistic is reevaluating the call of a model object. If all bootstrap estimates are identical, try setting assign.frame1=T. For examples where this is necessary see . Note that this slows down the algorithm, and may cause memory use to grow.
save.indices
either a logical flag indicating whether to return the matrix of resampling indices, the integer 2 indicating to return a compressed version of the indices, or NULL (the default) indicating to decide based on n and B. If not saved these can generally be recreated, by . Saving them speeds up some later calculations such as for and that need these indices. By default the indices are saved if n*B <= 20,000, and a compressed version (about 16 times smaller) if n*B <= 500,000. The compressed version is based on frequencies and loses information about the order that observations appear in bootstrap samples, so should be avoided if your statistic depends on the order of the data.
save.group
logical flag indicating whether to return the group vector. Default is TRUE if number of observations is <= 10000, FALSE otherwise. If not saved these can generally be recreated if needed, by .
save.subject
logical flag indicating whether to return the subject vector. Default is TRUE if number of observations is <= 10000, FALSE otherwise. If not saved these can generally be recreated if needed, by .
statistic.is.random
logical flag indicating whether the statistic itself performs randomization, in which case we need to keep track of two parallel seeds, one for the sampling and one for the statistic. If this argument is missing, the algorithm attempts to determine if the statistic involves randomization by evaluating it and checking whether the random seed has changed.
group.order.matters
indicates whether to maintain the order of groups during resampling. For example, if the data for one group occupy rows 51-100 and 110-115 of data and group.order.matters = T, then data for that group occupy those rows in each resample. Note that if you want group sample sizes different from those of the original sizes you need group.order.matters = F (see Examples, below). Ignored if not sampling by group.
order.matters
this should be NULL or FALSE for the ordinary bootstrap. If TRUE or character such as "resampling residuals", then the order of observations matters, and some functions such as and that are only for the ordinary bootstrap are disabled; the character string is printed. This is set by when resampling residuals.
seed.statistic
random number seed to be used for the statistic if it uses randomization.
L
empirical influence values. This may be a matrix with n rows (number of observations or subjects in data) and p columns (length of the returned statistic). Or it may be a string, one of "jackknife", "influence", "regression", "ace", or "choose"; the influence function values are then calculated using the coresponding method; see and . L="choose" corresponds to calling with method=NULL. The default L=NULL corresponds to not computing influence values. Influence values are used by a variety of downstream functions, and can be created as needed if not stored initially.

If subject is supplied, if L is numerical the rows of L should correspond to the sorted unique values of subject

In the case of bootstrap when the statistic is either mean or colMeans, there are no additional arguments (like na.rm or trim), and subject is not used, if L is NULL it is set equal to the data.

model.mat
matrix with n rows, one for each observation or subject, or a formula that defines such a model matrix. If supplied, and L is one of "regression", or "ace", then this is used when calculating influence functions values; see .
argumentList*
list of arguments to bootstrap. All arguments except data, statistic, group, and subject may be specified in this list, and their values override the values set by their regular placement in the argument list.
observed.indices
vector of indices; the observed value of the statistic will be computed using these rows of data. For hierarchical data, the indices are applied to the sorted values of subject. The default is to use all observations or subjects.
...
additional arguments which are passed to methods for bootstrap. Currently only the lm method has an extra argument, lmSampler.

SIDE EFFECTS:

If assign.frame1=T, you must be sure that this assignment does not overwrite some quantity of interest stored in frame 1.

The function causes creation of the dataset .Random.seed if it does not already exist, otherwise its value is updated.

DETAILS:

If statistic is an expression, then bootstrap does eval(statistic, local=list(dataName = (resampled data))) where dataName is either "data" or the name of the original data object. If args.stat is supplied, it should be a list, and eval is called with local=c(list(dataName = (resampled data)), args.stat)

Special methods are used when data is the output from a modeling function like . In this case, the special syntax
bootstrap(fit, statistic)
is allowed, where fit is the output of a modeling function. For example
fit.lm <- lm(Mileage~Weight, data=fuel.frame)
bootstrap(fit.lm, coef, B=500, seed=0)
The results are identical to
bootstrap(fuel.frame, coef(lm(Mileage~Weight, data=fuel.frame), B=500, seed=0)
The former invokes the method for bootstrap, which is faster. Other modeling methods for bootstrap include and . See , etc. for more details. For consistency the above syntax is allowed for other model fit objects (which have a call component or attribute whose call contains a data argument). See Examples below.

EXAMPLES:

# Bootstrap a mean; demonstrate summary(), plot(), qqnorm() 
bootstrap(stack.loss, mean) 
temp <- bootstrap(stack.loss, mean) 
temp 
summary(temp) 
plot(temp) 
qqnorm(temp) 
 
# Percentiles of the distribution 
limits.percentile(temp) 
 
# Confidence intervals 
limits.bca(temp) 
limits.bca(temp,detail=T) 
limits.tilt(temp) 
 
# Here the "statistic" argument is an expression, not a function. 
stack <- cbind(stack.loss, stack.x) 
bootstrap(stack, l1fit(stack[,-1], stack[,1])$coef, seed=0) 
 
# Again, but if the data is created on the fly, then 
# use the name "data" in the statistic expression: 
bootstrap(cbind(stack.loss, stack.x), 
          l1fit(data[,-1], data[,1])$coef, seed=0) 
temp <- bootstrap(stack, var)  # Here "statistic" is a function. 
parallel(~ temp$replicates)     # Interesting trellis plot. 
 
# Demonstrate the args.stat argument 
#   without args.stat: 
bootstrap(stack.loss, mean(stack.loss, trim=.2)) 
 
#   statistic is a function: 
bootstrap(stack.loss, mean, args.stat = list(trim=.2)) 
 
#   statistic is an expression, object "h" defined in args.stat 
bootstrap(stack.loss, mean(stack.loss, trim=h), 
          args.stat = list(h=.2)) 
 
# Bootstrap regression coefficients (in 3 equivalent ways). 
fit.lm <- lm(Mileage ~ Weight, fuel.frame) 
bootstrap(fuel.frame, coef(lm(Mileage ~ Weight, fuel.frame))) 
bootstrap(fuel.frame, coef(eval(fit.lm$call))) 
bootstrap(fit.lm, coef) 
 
# Bootstrap a nonlinear least squares analysis 
fit.nls <- nls(vel ~ (Vm * conc)/(K + conc), Puromycin, 
               start = list(Vm = 200, K = 0.1)) 
temp.nls <- bootstrap(Puromycin, coef(eval(fit.nls$call)), B=1000) 
pairs(temp.nls$rep) 
plot(temp.nls$rep[,1], temp.nls$rep[,2]) 
contour(hist2d(temp.nls$rep[,1], temp.nls$rep[,2])) 
image(hist2d(temp.nls$rep[,1], temp.nls$rep[,2])) 
 
# Jackknife after bootstrap 
jackknifeAfterBootstrap(temp.nls) 
jackknifeAfterBootstrap(temp.nls, stdev) 
 
# Bootstrap the calculation of a covariance matrix 
my.x <- runif(2000) 
my.dat <- cbind(x=my.x, y=my.x+0.5*rnorm(2000)) 
bootstrap(my.dat, var, B=1000) 
 
# Perform a jackknife analysis. 
jackknife(stack.loss, mean) 
 
# Two-sample problems 
# Bootstrap the distribution of the difference of two group means 
#  (group sizes vary across bootstrap samples) 
West <- (as.character(state.region) == "West") 
Income <- state.x77[,"Income"] 
bootstrap(data.frame(Income, West), 
          mean(data[ data[,"West"],"Income"]) - 
          mean(data[!data[,"West"],"Income"])) 
 
# Two-sample problem, using the group argument 
# (resampling is done separately within "West" and "not West", so 
#  group sizes are constant across bootstrap samples) 
bootstrap(Income, mean(Income[West])-mean(Income[!West]), group = West) 
 
# Passing arguments to the sampler (same argument for every group) 
bootstrap(Income, mean(Income[West])-mean(Income[!West]), group = West, 
          sampler = samp.bootstrap(size = 100), group.order.matters = F) 
 
# Passing arguments to the sampler (arguments vary by group) 
bootstrap(Income, mean(Income[West])-mean(Income[!West]), group = West, 
          sampler.args.group = list("TRUE"=list(size = 100), 
                                    "FALSE"=list(size = 50)), 
          group.order.matters = F) 
 
 
#Different sampling mechanisms 
 
# Permutation distribution for the difference in two group means, 
#  under the hypothesis of one population. 
# Note that either the group or response variable is permuted, not 
# both. 
bootObj <- bootstrap(Income, sampler = samp.permute, 
                     mean(Income[West])-mean(Income[!West])) 
1 - mean(bootObj$replicates < bootObj$observed)  # one-sided p-value 
 
# Balanced bootstrap 
bootstrap(stack.loss, mean, sampler=samp.boot.bal) 
 
# Bootstrapping unadjusted residuals in lm (2 equivalent ways) 
fit.lm <- lm(Mileage~Weight, fuel.frame) 
resids <- resid(fit.lm) 
preds  <- predict(fit.lm) 
bootstrap(resids, lm(resids+preds~fuel.frame$Weight)$coef, B=500, seed=0) 
bootstrap(fit.lm, coef, lmsampler="resid", B=500, seed=0) 
 
# Bootstrapping other model fit objects: gam 
fit.gam <-gam(Kyphosis ~ s(Age,4) + Number, family = binomial, 
              data = kyphosis) 
bootstrap(fit.gam, coef, B=100) 
 
# Bootstrap when patients have varying number of cases. 
DF <- data.frame(ID=rep(101:103, c(4,5,6)), x=1:15) 
DF  # Patient 101 has 4 cases, 102 has 5, 103 has 6. 
bootstrap(DF, mean(DF$x), subject=ID) 
 
## Importance sampling 
# 
importanceSampling <- function(data, statfun, B=1000){ 
  # Returns a list of arguments (B, sampler.prob and L)  
  # suitable for using importance sampling during bootstrap. 
  # 20% of samples with equal probabilities and 40% each from the  
  # left and right-tilted distributions.  The tilted distributions 
  # are centered at the .025 and .975 quantiles of the original data.  
  L <- influence(data, statfun)$L 
  tau <- saddlepointPSolve(probs=c(.025, .975), L) 
  weights1 <- tiltWeights(tau[1], L) 
  weights2 <- tiltWeights(tau[2], L) 
  list(B = c(.2, .4, .4)*B,  
       sampler.prob = list(NULL, weights1, weights2),  
       L = L) 
} 
set.seed(3) 
x <- rmvnorm(40, d=2, rho=.5) 
statfun <- function(x, weights = NULL)  
  cor(x[,1], x[,2], weights = weights) 
bootstrap(x, statfun, argumentList = importanceSampling(x, statfun)) 
   
 
## Run in background 
For(1, temp <- bootstrap(stack.loss, mean, B=1000), wait=F)