resampGetL(x, ...)
resampGetL.bootstrap(x, method = <<see below>>, ...,
model.mat, formula, data, frame.eval)
resampGetL.jackknife(x, method = <<see below>>, ..., frame.eval)
resampGetL.influence(x)
resampGetL.bootstrap2(x, ..., frame.eval)
data and
statistic.
resamp objects the possible values are
"jackknife" and
"influence". For
bootstrap objects
additional choices are
"ace" and
"regression".
Default values depend on sample sizes,
number
B of bootstrap replications present,
and whether sampling was by
group (stratified);
if
model.mat is present, or if
B>2*n+100 the default is
"ace". Otherwise
"influence" or
"jackknife" is used: if
stratified sampling was not used, the default is
"jackknife"; if
stratified sampling was used the method is
"influence" if the
statistic can be modified to include weights (see below),
"jackknife"
otherwise.
"ace" and
"regression", unless
formula is supplied.
~
operator, and the terms, separated by
+ operators, on the
right. Used to create the model matrix.
epsilon for
and
df for the "ace" methods, see
.
x can be found.
You need to specify this if objects can't be found by their
original names, or have changed; see
.
n rows, where
n is the original number of observations
or subjects; and
p columns, where the
statistic is
p-valued. When sampling by subject the
rows names of the result are the sorted unique values of the
subject argument taken from the call to
or
.
The results are normalized to sum to zero
(by group, if sampling by
group; see below).
The result has a
"method" attribute giving the method.
For the two regression methods, the result has a
"correlation" attribute giving the multiple correlation between
(transformed) bootstrap replicates and the linear approximation.
For the
"influence" method, the result has an
"epsilon"
component (see below).
The
"influence" method calculations are carried out by
,
using functional differentiation with a
finite value of
epsilon.
The
statistic must accept a
weights argument, or be an expression
involving one or more functions that accept a
weights argument.
The
"jackknife" method gives the ordinary jackknife estimate for the
empirical influence values.
Calculations are carried out by
which in turn may call
jackknife.
The
"regression" and
"ace" methods perform regression
with bootstrap replicates as the response
variable. They call
for calculations.
These methods run faster if the
indices are saved in
the bootstrap object.
The number of explanatory variables is
n, so these
methods should only be used if the number of bootstrap
replications
B is large enough to estimate that many parameters,
say
B>2*n+100.
The
"ace" variant perform an initial regression, transforms the
response to improve linearity, then performs a final regression.
The
model.mat matrix should have one row for each observation
(or for each subject).
An initial column of 1's is optional (it is added if not present).
It should contain columns which together
have a high "multiple correlation" with the statistic of interest.
For example, if the statistic is
var(x), then
cbind(x, x^2)
or
cbind(x, (x-mean(x))^2) would be suitable;
these could also be specified by formulae,
~x + x^2 or
~poly(x,2), respectively.
Here "multiple correlation" is between
the original bootstrapped statistic (
replicates)
and the (multivariate) bootstrapped sample means of the model matrix,
using the same bootstrap indices.
In other words, you can view each column of the model matrix as a set
of data whose sample mean is bootstrapped; these sample means
should have high multiple correlation with the actual statistics
in order for the resulting linear approximations to be accurate.
If
model.mat has
k columns, the number of bootstrap
replications
B is large enough to estimate that many parameters,
say
B>2*k+100.
Similarly, for
"regression" and
"ace" the multiple correlation
should be high for linear approximations to be accurate.
The estimated multiple correlation is given as an attribute to the
result. This is not adjusted for
degrees of freedom, or for the transformation used by the
"ace" method.
Sampling by
group (stratified sampling) and by
subject
are supported by all methods.
However, in the group case the
"jackknife" method should
not be used for some statistics. If the statistic would give a different
value if all observations in one group were repeated twice, this
indicates that the statistic does not normalize weights by group,
and the jackknife estimates will be mildly or badly inaccurate.
Sampling by
subject can also cause problems for the
"influence"
method, because statistics vary in how the weight for a subject
should be divided among the corresponding observations.
Currently the weights for a subject are replicated to each observation,
but this is subject to change.
For correct results with the
influence method,
all functions in the expression that
depend on the data should accept a
weights
argument. For example, suppose the original statistic is
mean(x)-median(x), where
mean has a
weights argument but
median
does not. The internal calculations create a new expression,
in which weights are added to every function that accepts them:
mean(x,weights=Splus.resamp.weights)-median(x).
Results are incorrect, because weighted medians are not calculated
when they should be.
median and other non-smooth functions also cause problems
for methods that depend on smoothness, including
"jackknife"
and
"influence" with a small value of
epsilon;
these finite-difference derivative methods are not suitable for non-smooth
statistics.
For such statistics
use the regression methods, or
"influence" with a large
epsilon,
e.g.
epsilon=1/sqrt(n) (the "butcher knife").
Davison, A.C. and Hinkley, D.V. (1997), Bootstrap Methods and Their Application, Cambridge University Press.
Efron, B. and Tibshirani, R.J. (1993), An Introduction to the Bootstrap, San Francisco: Chapman & Hall.
Hesterberg, T.C. (1995), "Tail-Specific Linear Approximations for Efficient Bootstrap Simulations," Journal of Computational and Graphical Statistics, 4, 113-133.
Hesterberg, T.C. and Ellis, S.J. (1999), "Linear Approximations for Functional Statistics in Large-Sample Applications," Technical Report No. 86, http://www.insightful.com/Hesterberg
resampGetL can fail when
method = "influence" if the statistic in
x calls a modeling function like
lm. See
for details.
bfit <- bootstrap(stack.loss, mean)
L1 <- resampGetL(bfit, "jackknife")
# Same result using jackknife object
jfit <- jackknife(stack.loss, mean)
L2 <- resampGetL(jfit)
all.equal(L1, L2)
### Example: correlation for bivariate data
set.seed(1); x <- rmvnorm(100, d=2, rho=.5)
bfit2 <- bootstrap(x, cor(x[,1], x[,2]), save.indices=T)
L1 <- resampGetL(bfit2) # "ace" method
L2 <- resampGetL(bfit2, model.mat = cbind(x, x^2, x[,1]*x[,2]))
L2b<- resampGetL(bfit2, formula = ~poly(x,2)) # equivalent to previous
L3 <- resampGetL(bfit2, method="jackknife")
L4 <- resampGetL(bfit2, method="influence")
L5 <- influence(x, cor(x[,1], x[,2]), returnL=T)
plot(x[,1], x[,2])
contour(interp(x[,1], x[,2], L4), add=T)
# points in top right and lower left have positive influence on correlation
contour(interp(x[,1], x[,2], L1), add=T, col=2) # more random variation
contour(interp(x[,1], x[,2], L2), add=T, col=3) # less random variation
all.equal(L2, L2b) # identical
all.equal(L4, L5) # identical
cor(cbind(L1, L2, L3, L4)) # high correlation
# Accuracy for linear approximation:
plot(indexMeans(L1, bfit2$indices) + bfit2$observed, bfit2$replicates,
xlab = "Linear approximation", ylab="Actual bootstrap values")
abline(0,1,col=2)
cor(indexMeans(L1, bfit2$indices), bfit2$replicates)
# correlation .989 between bootstrap replicates and linear approximation
attr(L1, "correlation") # .989
### Example: sampling by subject
bfit3 <- bootstrap(fuel.frame, mean(Fuel), subject = Type,
save.indices = T)
L1 <- resampGetL(bfit3, method = "ace")
means <- groupMeans(fuel.frame$Fuel, fuel.frame$Type)
counts <- table(fuel.frame$Type)
L2 <- resampGetL(bfit3, model.mat = cbind(means, counts, means*counts))
L3 <- resampGetL(bfit3, method="jackknife")
L4 <- resampGetL(bfit3, method="influence")
L5 <- resampGetL(bfit3, model.mat = cbind(means))
cor(cbind(L1, L2, L3, L4, L5)) # high correlation, except for L5
# The model.mat for L5 did not provide a suitable basis
# for predicting the bootstrap statistics (which correspond to
# means of resampled subject means, weighted by resampled subject counts)