Sections in this file:
Randomness
Non-functional statistics
Mismatch in number of observations
Scoping problems (this has a number of subsections)
Intermittent Statistic Failure (two subsections)
Bootstrap results are random, and results depend on the random number
seed used, number of blocks used, order of the data, and names of
groups and subjects (the sorted names determine the order in which
sample indices are drawn). These factors normally cause only small
differences in bootstrap results. You can reduce this by increasing
the number
B
of bootstrap replications.
Many of the resampling functions implicitly assume that the statistic
is "functional" -- that it depends only on the empirical distribution
(assuming equal probabilities on all observations), not on addition
information such as the sample size.
A functional statistic would return the same value if all observations
were repeated the same number of times.
Examples of statistics which are not functional include modeling
functions that use smoothing parameters that depend on n, and
var()
when called without weights and with
unbiased=T
.
You should exercise care when calling a non-functional statistic,
as the assumptions that underlying resampling methods may be violated.
Also note that some functions assign weights, which may cause the
behavior of the statistic to change. For example, by default var()
is normally not functional, but can be made so by specifying
unbiased=F
or by supplying weights:
var(1:5) # 2.5 -- not functional by default var(rep(1:5,2)) # 2.22222 var(1:5, unbiased=F) # 2 -- this gives the functional version var(rep(1:5,2), unbiased=F) # 2 var(1:5, weights=rep(1/5,5)) # 2 -- weights force functional version var(rep(1:5,2), weights=rep(1/10,10)) # 2Some resampling functions add weights when calling the statistic (e.g. influence), others do not (bootstrap and jackknife).
bootstrap(1:5, var, B=3)$observed # 2.5 jackknife(1:5, var, B=3)$observed # 2.5 influence(1:5, var)$observed # 2 -- is calculated with weightsThe results are self-consistent within a function, because both the observed value and all replicates are computed the same way (with or without weights). However, results differ across functions, both the observed values and output quantities --
jackknife
indicates
that
var
is unbiased (which is true for the default calculations
for
var
without weights, if the data are independent and identically
distributed), while
influence
indicates that it is biased
(which is true for the functional form of
var
).
This problem affects
jackknife
and functions that call
jackknife
,
including
limits.bca
and
summary
. If some but not all vectors used
by the statistic are contained in the
data
argument,
then only the vectors included in
data
have observations omitted.
West <- state.region) == "West" Income <- state.x77[,"Income"] # jackknife(Income, mean(Income[West])-mean(Income[!West])) # failsThat
jackknife
call fails.
bootstrap
works, but
limits.bca
or
summary
will fail:
bs <- bootstrap(Income, mean(Income[West])-mean(Income[!West]), group = West) # limits.bca(bs) # fails # summary(bs) # failsOne workaround is to include all vectors in the data. For example:
myData <- data.frame(West = (state.region == "West"), Income = state.x77[,"Income"]) jackknife(myData, mean(Income[West])-mean(Income[!West]))Alternately, to make
summary
and
limits.bca
work after bootstrapping,
you may avoid
jackknife
by using another method to calculate
L
(which is used to calculate acceleration); see
.
Incidentally, makes bootstrapping the difference in two means much easier.
This refers to the general problem of a function not finding data because the data is not in the search path for that function. The problems below are illustrated for , but also apply to and possibly other resampling functions.
Bootstrap or jackknife may fail when called from within another function. Here is an example.
# Scoping problem when bootstrap is called from a function, # and statistic is an expression that uses objects defined in # the function. # fun1 <- function(){ x <- 1:20 fun2 <- function(x) mean(x, trim=.2) bootstrap(data = x, statistic = fun2(x)) } fun1()This fails with the message that function
fun2
(which is defined in
the frame of
fun1
) could not be found. One workaround is to assign
fun2
to frame 1 before calling bootstrap:
fun1 <- function() { x <- 1:20 fun2 <- function(x) mean(x, trim=.2) assign("fun2", fun2, frame = 1) # workaround bootstrap(data = x, statistic = fun2(x)) } fun1()One additional caution -- if you don't use the workaround, and
fun2
is defined on a permanent
database, then the permanent copy is used (not the copy inside
fun1
).
This scoping problem occurs when argument
statistic
is an
expression. When
statistic
is a name
or a function, it is not necessary to assign to frame 1 in most cases.
An alternate
workaround is therefore to replace
"statistic = fun2(x)"
by
"statistic = fun2"
.
An exception occurs when bootstrapping an
lm
object:
fun1 <- function() { x <- 1:20 set.seed(0) y <- sort(runif(20)) fit <- lm(y~x) fun2 <- function(x) coef(x) bootstrap(data = fit, statistic = fun2) } fun1()This also fails, even though
statistic
is a
name. The hidden behavior here is that
converts
statistic
to an expression before calling
the default method for
.
Assigning
fun2
to frame 1 overcomes the error.
A similar scoping problem can occur with the
data
argument, when
data
is a fitted model object defined in a frame other than that
containing the call to bootstrap. Here is an example using
.
fun1 <- function() { x <- 1:20 set.seed(0) y <- sort(runif(20)) fit <- lm(y~x) fun2 <- function(fit) bootstrap(fit, coef, B=100) fun2(fit) } fun1()Here bootstrap is called from
fun2
but the
objects
x
and
y
, are defined in the frame of
fun1
.
While they do not appear directly in the call to
bootstrap
,
they are used indirectly.
So again one workaround is
to assign
x
and
y
to frame 1 before calling
fun2
.
Another solution is to pass
x
and
y
to
fun2
:
fun1 <- function() { x <- 1:20 set.seed(0) y <- sort(runif(20)) fit <- lm(y~x) fun2 <- function(fit, x, y){ # make sure that copies of x and y are defined in fun2 x <- x; y <- y bootstrap(fit, coef, B=100) } fun2(fit, x, y) } fun1()
Consider the following example using jackknife.
jackknife(data = kyphosis, statistic = coef(glm(Kyphosis ~ kyphosis[[3]], data = kyphosis)))This causes the error
Problem in eval(statistic, c(list(kyphosis = data), ..: Length of kyphosis[[3]] (variable 2) is 81 != length of others (80)This known bug in Spotfire S+ occurs during the evaluation of the statistic by . There are two different types of references to the data in the formula argument to
glm
. The
first variable,
Kyphosis
, is the name of the first variable in the
kyphosis
data set. The second variable,
kyphosis[[3]]
, references
the third column of
kyphosis
by index rather than name. It is the
latter which is mis-handled -- the third column of the full dataset is
used rather than that of the jackknifed (one row-deleted) data.
There are several workarounds. One is to use variable names rather
than column indices: the third column of kyphosis is
"Number"
, so
we can obtain the desired results using
jackknife(data = kyphosis, statistic = coef(glm(Kyphosis ~ Number, data = kyphosis)))
Another workaround is to
avoid using the
data
argument to
glm
(it is redundant, since
the
data
argument to
jackknife
specifies the data to be used).
jackknife(data = kyphosis, statistic = coef(glm(Kyphosis ~ kyphosis[[3]])))Another workaround is to use
assign.frame1 = T
, to put the jackknifed
version of the data on frame 1 where it hides the original.
jackknife(data = kyphosis, assign.frame1 = T, statistic = coef(glm(Kyphosis ~ kyphosis[[3]], data = kyphosis)))A final workaround is to use the new
glm
method for
jackknife
,
which has the additional benefit of generally executing faster.
jackknife(data = glm(Kyphosis ~ kyphosis[[3]], data = kyphosis), statistic = coef)The same problem occurs with
bootstrap
, though it is more
insidious. The data in the call
boot.obj <- bootstrap(data = kyphosis, statistic = coef(glm(Kyphosis ~ kyphosis[[3]], data = kyphosis)), B = 100, seed = 0)is mishandled in the same way as the original
jackknife
example. But
there is no error message since, for bootstrap, the original data and
the resampled data are both the same size. The results, however, are
wrong. An attempt to use
summary
on
boot.obj
,
summary(boot.obj)
jackknife
-type error message we got before, because
jackknife
is called by
summary
to evaluate BCa
limits. It is important to remember, however, that
boot.obj
contains incorrect results, independent of any future call to
summary
. Any of the above workarounds for
jackknife
can be used to
get correct results for
bootstrap
.
In
influence
,
resampGetL
,
limits.tilt
, and
limits.abc
.
This is another manifestation of the known Spotfire S+ scoping bug
mentioned above. For example,
xy <- data.frame(x = 1:10, y = sort(runif(10))) influence(xy, coef(lm(y~x, data = xy))) # failsfails with a message about not finding weights. As with the
jackknife
bugs in the previous section, the
data
argument to
lm
is redundant here, and the error disappears if we get rid of it.
influence(xy, coef(lm(y~x))) # works
assign.frame1 = T
also gets around the error.
influence(xy, coef(lm(y~x)), assign.frame1 = T) # works
"influence"
method of
resampGetL
,
which invokes
influence
. Thus the following fails.
bfit <- bootstrap(xy, coef(lm(y~x, data = xy)), B=20, seed = 0) resampGetL(bfit, method="influence") # failsWe can workaround this error by getting rid of the
data
argument.
bfit <- bootstrap(xy, coef(lm(y~x)), B=20, seed = 0) resampGetL(bfit, method="influence") # worksOr we can use the
lm
method for
bootstrap
.
bfit <- bootstrap(lm(y~x, data = xy), coef, B=20, seed = 0) resampGetL(bfit, method="influence") # nope, works but all zerosNote that the other methods for
resampGetL
(
"jackknife"
,
"ace"
, etc.) do not have these problems.
The same scoping problem manifests itself in , , and when the statistic involves modeling functions. For example,
bfit <- bootstrap(xy, coef(lm(y~x, data = xy)), B=20, seed = 0) limits.tilt(bfit) # fails tiltAfterBootstrap(bfit) # fails limits.abc(fuel.frame, coef(lm(cform, data = fuel.frame))) # failsThe same cures work for these functions.
Here is an example where fails because the statistic sometimes fails.
# fit Michaelis and Menten's original data. conc <- c(0.3330, 0.1670, 0.0833, 0.0416, 0.0208, 0.0104, 0.0052) vel <- c(3.636, 3.636, 3.236, 2.666, 2.114, 1.466, 0.866) Micmen <- data.frame(conc=conc, vel=vel) param(Micmen,"K") <- 0.02; param(Micmen,"Vm") <- 3.7 fit <- nls(vel~Vm*conc/(K+conc), data = Micmen) set.seed(0) bootstrap(Micmen, coef(nls(vel~Vm*conc/(K+conc), data = Micmen)), B=200, seed=0) # fails, step factor below minimum on replication 144One workaround is to construct a new statistic that can detect failures in
nls
and to simply return
NA
for those samples. The
function
try
can be used for this purpose.
# Use try(). If the result is of class "Error", then return # rep(NA, same length as other replications) try.expr <- Quote({result <- try(coef(nls(vel~Vm*conc/(K+conc), data = Micmen))) if(is(result, "Error")) rep(NA,2) else result }) bootstrap(Micmen, try.expr, B=200, seed=0)The error still occurs for the bad sample, but the remaining samples can be processed.
The same type of problem can occur with other modeling functions. One example occurs with using factor data. If one of the factor categories is relatively rare, some bootstrap samples will not contain that category.
abc <- data.frame(x=1:20, y=sort(runif(20)), abc=factor(c(rep("a",8), rep("b",9), rep("c",3)))) bootstrap(lm(y~x+abc, data=abc), coef, B=100, seed = 0)An error occurs in due to the singular fit. It does not help to specify
singular.ok=T
in the call to
:
bootstrap(lm(y~x+abc, data=abc, singular.ok = T), coef, B=100, seed = 0)Now complains that statistic returns results with varying length, because returns fewer coefficients on those samples lacking category "c". One solution is to use
singular.ok=F
and to
handle failures using
,
as above.
try.expr <- Quote({result <- try(coef(data)) if(is(result, "Error")) rep(NA, 4) else result}) bootstrap(lm(y~x+abc, data=abc), try.expr, B=100, seed = 0)Another solution is to use
singular.ok = T
and
rather than
.
bootstrap(lm(y~x+abc, data=abc, singular.ok = T), coef.default, B=100, seed = 0)This works because (which actually dispatches to ) only returns coefficients corresponding to the non-singular part of the model, while returns all coefficients, including
NA
for the singular parts.
As an aside, note that
simply extracts the
coefficients
component from the
object. Therefore an equivalent workaround is
bootstrap(lm(y~x+abc, data=abc, singular.ok = T), data$coefficients, B=100, seed = 0)
If you create a numeric object named
mean
, then this fails:
bootstrap(x, mean)The solution is to remove your object which is masking
mean
.
Similarly for other functions. You can check for these using
masked()
and some methods for
and
create objects with a modified
call
component, which prevents
from working properly. For example:
fit <- lm(Fuel ~ Weight, data = fuel.frame) boot1 <- bootstrap(fit, coef, B=30, seed=0) boot2 <- update(boot1, lmsampler = "residuals")That gives warning about unrecognized arguments. The reason is that modifies the call:
boot1$call # bootstrap(data = lm(formula = Fuel ~ Weight, data = fuel.frame, # method = "model.list"), statistic = coef.default(lm(data)), # B = 30, seed = 0)The first workaround is to give the command by hand:
boot2b <- bootstrap(fit, coef, B=30, seed = 0, lmsampler = "residuals")The second is to replace the
call
component with the actual call,
then update:
boot1b <- boot1 # work with a copy in case you need boot1 later boot1b$call <- boot1$actual.calls[[1]] boot2c <- update(boot1b, lmsampler = "residuals")The two methods yield the same results (if using the same seed):
all.equal(boot2b, boot2c) # TRUE