bootstrap
;
the same arguments are used in many other resampling functions.
A shorter description, omitting some arguments, is found in
.
bootstrap(data, statistic, B = 1000, args.stat, group, subject, sampler = samp.bootstrap, seed = .Random.seed, sampler.prob, sampler.args, sampler.args.group, resampleColumns, label, statisticNames, block.size = min(100,B), trace = resampleOptions()$trace, assign.frame1 = F, save.indices = <<see below>>, save.group = <<see below>>, save.subject = <<see below>>, statistic.is.random, group.order.matters = T, order.matters, seed.statistic = 500, L = NULL, model.mat, argumentList, observed.indices = 1:n, ...)
data
is an ordinary matrix
or vector rather than a data frame.
May also be the output of a modeling function like
lm
;
see Details below.
The statistic may be a function (e.g.
mean
) which
accepts data as the first unmatched argument; other arguments may
be passed to the function through
args.stat
, e.g.
args.stat=list(trim=.2)
.
By "unmatched", we mean the first argument that is not given by
name in
args.stat
. E.g. if your function is
f(a,b,...)
and you specify
args.stat=list(a=3)
then the data is passed as the
b
argument to your function.
Or the statistic may be an expression such as
mean(x, trim=.2)
.
If the
data
object has a simple name (e.g.
data=x
)
then use that name (
"x"
)
in the expression,
otherwise (e.g.
data=df$y
) use the name
"data"
in the expression,
e.g.
mean(data, trim=.2
).
If
data
is a data frame, the expression may involve variables
in the data frame,
e.g.
data=air,statistic=mean(ozone/wind)
.
The following types of expressions are not allowed: an expression
that returns a function or a function name (e.g.
statistic = object$fun
, where
object$fun
contains the function
function(x) mean(x)
or the name
mean
); an expression
that returns an expression (e.g.
statistic = object$stat
, where
object$stat
contains the expression
mean(x, trim=.2)
. On the other
hand,
statistic
may be the name of a function or an expression. For
example,
statistic = fun
, where
fun
contains function(x)
mean(x), or
statistic = stat.expr
, where
stat.expr
contains
mean(x, trim=.2)
.
B[1]
samples are generated, then the next
B[2]
, and so on; also see the
sampler.prob
argument.
statistic
is a function, a
list of other arguments, if any, to pass to
statistic
when calculating
the statistic on the resamples.
The names of the list are used as argument names.
If
statistic
is an expression, then
args.stat
a list of objects which should
be included in the frame where the expression is evaluated;
names of the list are used as object names.
e.g.
statistic=mean(x,trim=alpha),args.stat=list(alpha=alphaVector[i])
indicates that
alpha
is given the value of
alphaVector[i]
in a place it can be found when the statistic is evaluated.
data
, for
stratified sampling or multiple-sample problems.
Sampling is done separately for each group
(determined by unique values of this vector),
and indices are combined to create a full resample.
The statistic is calculated for the resample as a whole.
If
data
is a data frame, this may be a variable in the data frame,
or an expression involving such variables,
e.g.
data=lung, group=sex
or
data=lung, group=age<50
.
data
;
if present then subjects
(determined by unique values of this vector) are resampled rather than
individual observations.
If
data
is a data frame, this may be a variable in the data frame,
or an expression involving such variables.
If
group
is also present this must be nested within
group
(a single subject may not be present in multiple groups).
If
subject
is the name
of a variable in the data frame
(for example
data=Orthodont, subject=Subject
),
then
bootstrap
makes
resampled subjects unique; that is, duplicated subjects in a
given resample are assigned distinct
subject
values in the resampled
data frame before the
statistic is evaluated; this is useful for longitudinal and other
modeling where the statistic expects subjects to have unique values.
Unique subject values are not assigned if
subject
is not a variable in the
data frame, or if the
subject
variable is not referred to solely by name;
(e.g.
subject=Orthodont$Subject
subject=Orthodont[,3]
, or
subject=Orthodont[,"Subject"]
)
sampler
may also be an expression such as
samp.bootstrap(size = 100)
for setting optional arguments to the sampler. Arguments set in
this way override those set by
sampler.args
.
If you do this, do not include the
n
and
B
arguments to the sampler;
they are generated automatically.
NULL
, vector of probabilities of length
n
(the number of
observations or subjects), or
list of the same length as
B
, each of whose elements
is
NULL
or a vector of length
n
; the
j
th element of
this list is used for
B[j]
samples.
This argument is used to do importance sampling.
Sampling is done with specified probabilities,
but
bootstrap
will also create a vector
weights
which is used when computing estimates (mean, bias,
estimates, quantiles, etc.) to counteract
the importance sampling bias.
The result is that all estimates are for a
target distribution of sampling without replacement.
In the long run you'll get the same results using importance sampling
as with equal-probability sampling; in the short term there may
be less Monte Carlo variability, with appropriately chosen probabilities.
To get estimates for other target distributions (if you want bootstrap
distributions that correspond to weighted empirical distributions) use
,
as a post-processing step; this may be done whether or not
you specified
sampler.prob
.
sampler
.
An alternative to passing
sampler.args
is to give the arguments
when calling
sampler
, see above.
sampler
for that group. The
list
sampler.args.group
may be named, in which case the names must
match the unique values of argument
group
. Otherwise the list is
assumed to be ordered with respect to the sorted, unique values of
group
. Arguments
sampler.args
and
sampler.args.group
may be
used simultaneously, in which case the values from
sampler.args.group
take precedence.
This is ignored if not sampling by
group
.
Suppose you are doing stratified sampling, say with strata sizes 50
and 70, and that you want bootstrap samples of size 49 and 69
(to avoid downward bias in standard errors). You may do this
using
sampler.args = list(list(size=49), list(size=49))
.
Alternately, you may use
sampler.args=list(reduceSize=1)
.
bootstrap
uses nested
for()
loops;
an outermost loop if
B
is a vector,
a loop over blocks,
in which all indices for
block.size
resamples are generated simultaneously,
and an inner loop in which the statistic is calculated for each resample.
The tradeoff is that if
block.size*n
is too large then the matrix
of resampling indices may be large, while if
block.size
is small
then random number generators are called more often, which entails
extra overhead.
The
block.size
argument also affects the quality of some samplers.
For example,
balanced bootstrapping
gives balancing done separately within each group of resamples.
This is biased, of order O(1/
block.size
), so increasing
block.size
reduces the bias.
bootstrap
estimates are identical, try setting
assign.frame1=T
.
For examples where this is necessary see
.
Note that this slows down the algorithm, and may cause memory use to grow.
2
indicating to return a compressed version of the indices,
or
NULL
(the default) indicating to decide based on
n
and
B
.
If not saved these can generally be recreated, by
.
Saving them speeds up some later calculations such as for
and
that need these indices.
By default the indices are saved if
n*B <= 20,000
, and a compressed
version (about 16 times smaller) if
n*B <= 500,000
.
The compressed version is based on frequencies and loses information
about the order that observations appear in bootstrap samples, so
should be avoided if your statistic depends on the order of the data.
group
vector. Default
is
TRUE
if number of observations is
<= 10000
,
FALSE
otherwise.
If not saved these can generally be recreated if needed, by
.
subject
vector. Default
is
TRUE
if number of observations is
<= 10000
,
FALSE
otherwise.
If not saved these can generally be recreated if needed, by
.
data
and
group.order.matters = T
, then
data for that group occupy those rows in each resample.
Note that if you want group sample sizes different from those of the
original sizes you need
group.order.matters = F
(see Examples, below).
Ignored if not sampling by
group
.
NULL
or
FALSE
for the ordinary bootstrap.
If
TRUE
or character such as
"resampling residuals"
,
then the order of observations matters,
and some functions such as
and
that are only for the ordinary bootstrap are disabled;
the character string is printed.
This is set by
when resampling residuals.
n
rows (number of observations or subjects in
data
)
and
p
columns (length of the returned statistic).
Or it may be a string, one of
"jackknife"
,
"influence"
,
"regression"
,
"ace"
, or
"choose"
; the influence function values
are then calculated using the coresponding method; see
and
.
L="choose"
corresponds to calling
with
method=NULL
.
The default
L=NULL
corresponds to not computing influence values.
Influence values are used by
a variety of downstream functions, and can be created as needed if not
stored initially.
If
subject
is supplied, if
L
is numerical the rows of
L
should
correspond to the sorted unique values of
subject
In the case of
bootstrap
when the statistic is either
mean
or
colMeans
, there are no additional arguments (like
na.rm
or
trim
),
and
subject
is not used,
if
L
is NULL it is set equal to the data.
n
rows, one for each observation or subject,
or a formula that defines such a model matrix.
If supplied, and
L
is one of
"regression"
, or
"ace"
,
then this is used when calculating influence functions values;
see
.
data
,
statistic
,
group
, and
subject
may be specified in this list, and
their values override the values set by their regular placement in the
argument list.
data
. For hierarchical data, the
indices are applied to the sorted values of
subject
. The default is
to use all observations or subjects.
bootstrap
.
Currently only the
lm
method has an extra argument,
lmSampler
.
assign.frame1=T
, you must be sure that this assignment does not
overwrite some quantity of interest stored in frame 1.
The function causes creation of the dataset
.Random.seed
if it does
not already exist, otherwise its value is updated.
If
statistic
is an expression, then
bootstrap
does
eval(statistic, local=list(dataName = (resampled data)))
where
dataName
is either
"data"
or the name of the original data object.
If
args.stat
is supplied, it should be a list, and
eval
is called with
local=c(list(dataName = (resampled data)), args.stat)
Special methods are used when
data
is the output from a modeling
function like
.
In this case, the special syntax
bootstrap(fit, statistic)
is allowed, where
fit
is the output of a modeling function. For example
fit.lm <- lm(Mileage~Weight, data=fuel.frame)
bootstrap(fit.lm, coef, B=500, seed=0)
The results are identical to
bootstrap(fuel.frame, coef(lm(Mileage~Weight, data=fuel.frame), B=500, seed=0)
The former invokes the
method for
bootstrap
, which is faster.
Other modeling methods for bootstrap include
and
.
See
,
etc. for more details.
For consistency the above syntax is
allowed for other model fit objects (which have
a
call
component or attribute
whose call contains a
data
argument). See Examples below.
# Bootstrap a mean; demonstrate summary(), plot(), qqnorm() bootstrap(stack.loss, mean) temp <- bootstrap(stack.loss, mean) temp summary(temp) plot(temp) qqnorm(temp) # Percentiles of the distribution limits.percentile(temp) # Confidence intervals limits.bca(temp) limits.bca(temp,detail=T) limits.tilt(temp) # Here the "statistic" argument is an expression, not a function. stack <- cbind(stack.loss, stack.x) bootstrap(stack, l1fit(stack[,-1], stack[,1])$coef, seed=0) # Again, but if the data is created on the fly, then # use the name "data" in the statistic expression: bootstrap(cbind(stack.loss, stack.x), l1fit(data[,-1], data[,1])$coef, seed=0) temp <- bootstrap(stack, var) # Here "statistic" is a function. parallel(~ temp$replicates) # Interesting trellis plot. # Demonstrate the args.stat argument # without args.stat: bootstrap(stack.loss, mean(stack.loss, trim=.2)) # statistic is a function: bootstrap(stack.loss, mean, args.stat = list(trim=.2)) # statistic is an expression, object "h" defined in args.stat bootstrap(stack.loss, mean(stack.loss, trim=h), args.stat = list(h=.2)) # Bootstrap regression coefficients (in 3 equivalent ways). fit.lm <- lm(Mileage ~ Weight, fuel.frame) bootstrap(fuel.frame, coef(lm(Mileage ~ Weight, fuel.frame))) bootstrap(fuel.frame, coef(eval(fit.lm$call))) bootstrap(fit.lm, coef) # Bootstrap a nonlinear least squares analysis fit.nls <- nls(vel ~ (Vm * conc)/(K + conc), Puromycin, start = list(Vm = 200, K = 0.1)) temp.nls <- bootstrap(Puromycin, coef(eval(fit.nls$call)), B=1000) pairs(temp.nls$rep) plot(temp.nls$rep[,1], temp.nls$rep[,2]) contour(hist2d(temp.nls$rep[,1], temp.nls$rep[,2])) image(hist2d(temp.nls$rep[,1], temp.nls$rep[,2])) # Jackknife after bootstrap jackknifeAfterBootstrap(temp.nls) jackknifeAfterBootstrap(temp.nls, stdev) # Bootstrap the calculation of a covariance matrix my.x <- runif(2000) my.dat <- cbind(x=my.x, y=my.x+0.5*rnorm(2000)) bootstrap(my.dat, var, B=1000) # Perform a jackknife analysis. jackknife(stack.loss, mean) # Two-sample problems # Bootstrap the distribution of the difference of two group means # (group sizes vary across bootstrap samples) West <- (as.character(state.region) == "West") Income <- state.x77[,"Income"] bootstrap(data.frame(Income, West), mean(data[ data[,"West"],"Income"]) - mean(data[!data[,"West"],"Income"])) # Two-sample problem, using the group argument # (resampling is done separately within "West" and "not West", so # group sizes are constant across bootstrap samples) bootstrap(Income, mean(Income[West])-mean(Income[!West]), group = West) # Passing arguments to the sampler (same argument for every group) bootstrap(Income, mean(Income[West])-mean(Income[!West]), group = West, sampler = samp.bootstrap(size = 100), group.order.matters = F) # Passing arguments to the sampler (arguments vary by group) bootstrap(Income, mean(Income[West])-mean(Income[!West]), group = West, sampler.args.group = list("TRUE"=list(size = 100), "FALSE"=list(size = 50)), group.order.matters = F) #Different sampling mechanisms # Permutation distribution for the difference in two group means, # under the hypothesis of one population. # Note that either the group or response variable is permuted, not # both. bootObj <- bootstrap(Income, sampler = samp.permute, mean(Income[West])-mean(Income[!West])) 1 - mean(bootObj$replicates < bootObj$observed) # one-sided p-value # Balanced bootstrap bootstrap(stack.loss, mean, sampler=samp.boot.bal) # Bootstrapping unadjusted residuals in lm (2 equivalent ways) fit.lm <- lm(Mileage~Weight, fuel.frame) resids <- resid(fit.lm) preds <- predict(fit.lm) bootstrap(resids, lm(resids+preds~fuel.frame$Weight)$coef, B=500, seed=0) bootstrap(fit.lm, coef, lmsampler="resid", B=500, seed=0) # Bootstrapping other model fit objects: gam fit.gam <-gam(Kyphosis ~ s(Age,4) + Number, family = binomial, data = kyphosis) bootstrap(fit.gam, coef, B=100) # Bootstrap when patients have varying number of cases. DF <- data.frame(ID=rep(101:103, c(4,5,6)), x=1:15) DF # Patient 101 has 4 cases, 102 has 5, 103 has 6. bootstrap(DF, mean(DF$x), subject=ID) ## Importance sampling # importanceSampling <- function(data, statfun, B=1000){ # Returns a list of arguments (B, sampler.prob and L) # suitable for using importance sampling during bootstrap. # 20% of samples with equal probabilities and 40% each from the # left and right-tilted distributions. The tilted distributions # are centered at the .025 and .975 quantiles of the original data. L <- influence(data, statfun)$L tau <- saddlepointPSolve(probs=c(.025, .975), L) weights1 <- tiltWeights(tau[1], L) weights2 <- tiltWeights(tau[2], L) list(B = c(.2, .4, .4)*B, sampler.prob = list(NULL, weights1, weights2), L = L) } set.seed(3) x <- rmvnorm(40, d=2, rho=.5) statfun <- function(x, weights = NULL) cor(x[,1], x[,2], weights = weights) bootstrap(x, statfun, argumentList = importanceSampling(x, statfun)) ## Run in background For(1, temp <- bootstrap(stack.loss, mean, B=1000), wait=F)