resampGetL(x, ...) resampGetL.bootstrap(x, method = <<see below>>, ..., model.mat, formula, data, frame.eval) resampGetL.jackknife(x, method = <<see below>>, ..., frame.eval) resampGetL.influence(x) resampGetL.bootstrap2(x, ..., frame.eval)
data
and
statistic
.
resamp
objects the possible values are
"jackknife"
and
"influence"
. For
bootstrap
objects
additional choices are
"ace"
and
"regression"
.
Default values depend on sample sizes,
number
B
of bootstrap replications present,
and whether sampling was by
group
(stratified);
if
model.mat
is present, or if
B>2*n+100
the default is
"ace"
. Otherwise
"influence"
or
"jackknife"
is used: if
stratified sampling was not used, the default is
"jackknife"
; if
stratified sampling was used the method is
"influence"
if the
statistic can be modified to include weights (see below),
"jackknife"
otherwise.
"ace"
and
"regression"
, unless
formula
is supplied.
~
operator, and the terms, separated by
+
operators, on the
right. Used to create the model matrix.
epsilon
for
and
df
for the "ace" methods, see
.
x
can be found.
You need to specify this if objects can't be found by their
original names, or have changed; see
.
n
rows, where
n
is the original number of observations
or subjects; and
p
columns, where the
statistic is
p
-valued. When sampling by subject the
rows names of the result are the sorted unique values of the
subject
argument taken from the call to
or
.
The results are normalized to sum to zero
(by group, if sampling by
group
; see below).
The result has a
"method"
attribute giving the method.
For the two regression methods, the result has a
"correlation"
attribute giving the multiple correlation between
(transformed) bootstrap replicates and the linear approximation.
For the
"influence"
method, the result has an
"epsilon"
component (see below).
The
"influence"
method calculations are carried out by
,
using functional differentiation with a
finite value of
epsilon
.
The
statistic
must accept a
weights
argument, or be an expression
involving one or more functions that accept a
weights
argument.
The
"jackknife"
method gives the ordinary jackknife estimate for the
empirical influence values.
Calculations are carried out by
which in turn may call
jackknife
.
The
"regression"
and
"ace"
methods perform regression
with bootstrap replicates as the response
variable. They call
for calculations.
These methods run faster if the
indices
are saved in
the bootstrap object.
The number of explanatory variables is
n
, so these
methods should only be used if the number of bootstrap
replications
B
is large enough to estimate that many parameters,
say
B>2*n+100
.
The
"ace"
variant perform an initial regression, transforms the
response to improve linearity, then performs a final regression.
The
model.mat
matrix should have one row for each observation
(or for each subject).
An initial column of 1's is optional (it is added if not present).
It should contain columns which together
have a high "multiple correlation" with the statistic of interest.
For example, if the statistic is
var(x)
, then
cbind(x, x^2)
or
cbind(x, (x-mean(x))^2)
would be suitable;
these could also be specified by formulae,
~x + x^2
or
~poly(x,2)
, respectively.
Here "multiple correlation" is between
the original bootstrapped statistic (
replicates
)
and the (multivariate) bootstrapped sample means of the model matrix,
using the same bootstrap indices.
In other words, you can view each column of the model matrix as a set
of data whose sample mean is bootstrapped; these sample means
should have high multiple correlation with the actual statistics
in order for the resulting linear approximations to be accurate.
If
model.mat
has
k
columns, the number of bootstrap
replications
B
is large enough to estimate that many parameters,
say
B>2*k+100
.
Similarly, for
"regression"
and
"ace"
the multiple correlation
should be high for linear approximations to be accurate.
The estimated multiple correlation is given as an attribute to the
result. This is not adjusted for
degrees of freedom, or for the transformation used by the
"ace"
method.
Sampling by
group
(stratified sampling) and by
subject
are supported by all methods.
However, in the group case the
"jackknife"
method should
not be used for some statistics. If the statistic would give a different
value if all observations in one group were repeated twice, this
indicates that the statistic does not normalize weights by group,
and the jackknife estimates will be mildly or badly inaccurate.
Sampling by
subject
can also cause problems for the
"influence"
method, because statistics vary in how the weight for a subject
should be divided among the corresponding observations.
Currently the weights for a subject are replicated to each observation,
but this is subject to change.
For correct results with the
influence
method,
all functions in the expression that
depend on the data should accept a
weights
argument. For example, suppose the original statistic is
mean(x)-median(x)
, where
mean
has a
weights
argument but
median
does not. The internal calculations create a new expression,
in which weights are added to every function that accepts them:
mean(x,weights=Splus.resamp.weights)-median(x)
.
Results are incorrect, because weighted medians are not calculated
when they should be.
median
and other non-smooth functions also cause problems
for methods that depend on smoothness, including
"jackknife"
and
"influence"
with a small value of
epsilon
;
these finite-difference derivative methods are not suitable for non-smooth
statistics.
For such statistics
use the regression methods, or
"influence"
with a large
epsilon
,
e.g.
epsilon=1/sqrt(n)
(the "butcher knife").
Davison, A.C. and Hinkley, D.V. (1997), Bootstrap Methods and Their Application, Cambridge University Press.
Efron, B. and Tibshirani, R.J. (1993), An Introduction to the Bootstrap, San Francisco: Chapman & Hall.
Hesterberg, T.C. (1995), "Tail-Specific Linear Approximations for Efficient Bootstrap Simulations," Journal of Computational and Graphical Statistics, 4, 113-133.
Hesterberg, T.C. and Ellis, S.J. (1999), "Linear Approximations for Functional Statistics in Large-Sample Applications," Technical Report No. 86, http://www.insightful.com/Hesterberg
resampGetL
can fail when
method = "influence"
if the statistic in
x
calls a modeling function like
lm
. See
for details.
bfit <- bootstrap(stack.loss, mean) L1 <- resampGetL(bfit, "jackknife") # Same result using jackknife object jfit <- jackknife(stack.loss, mean) L2 <- resampGetL(jfit) all.equal(L1, L2) ### Example: correlation for bivariate data set.seed(1); x <- rmvnorm(100, d=2, rho=.5) bfit2 <- bootstrap(x, cor(x[,1], x[,2]), save.indices=T) L1 <- resampGetL(bfit2) # "ace" method L2 <- resampGetL(bfit2, model.mat = cbind(x, x^2, x[,1]*x[,2])) L2b<- resampGetL(bfit2, formula = ~poly(x,2)) # equivalent to previous L3 <- resampGetL(bfit2, method="jackknife") L4 <- resampGetL(bfit2, method="influence") L5 <- influence(x, cor(x[,1], x[,2]), returnL=T) plot(x[,1], x[,2]) contour(interp(x[,1], x[,2], L4), add=T) # points in top right and lower left have positive influence on correlation contour(interp(x[,1], x[,2], L1), add=T, col=2) # more random variation contour(interp(x[,1], x[,2], L2), add=T, col=3) # less random variation all.equal(L2, L2b) # identical all.equal(L4, L5) # identical cor(cbind(L1, L2, L3, L4)) # high correlation # Accuracy for linear approximation: plot(indexMeans(L1, bfit2$indices) + bfit2$observed, bfit2$replicates, xlab = "Linear approximation", ylab="Actual bootstrap values") abline(0,1,col=2) cor(indexMeans(L1, bfit2$indices), bfit2$replicates) # correlation .989 between bootstrap replicates and linear approximation attr(L1, "correlation") # .989 ### Example: sampling by subject bfit3 <- bootstrap(fuel.frame, mean(Fuel), subject = Type, save.indices = T) L1 <- resampGetL(bfit3, method = "ace") means <- groupMeans(fuel.frame$Fuel, fuel.frame$Type) counts <- table(fuel.frame$Type) L2 <- resampGetL(bfit3, model.mat = cbind(means, counts, means*counts)) L3 <- resampGetL(bfit3, method="jackknife") L4 <- resampGetL(bfit3, method="influence") L5 <- resampGetL(bfit3, model.mat = cbind(means)) cor(cbind(L1, L2, L3, L4, L5)) # high correlation, except for L5 # The model.mat for L5 did not provide a suitable basis # for predicting the bootstrap statistics (which correspond to # means of resampled subject means, weighted by resampled subject counts)