Compute linear approximation for Resample Objects

DESCRIPTION:

Calculate linear approximations, for a (linear or nonlinear) statistic. The function is generic (see Methods) with methods for , , and . The default method handles other objects.

USAGE:

resampGetL(x, ...) 
resampGetL.bootstrap(x, method = <<see below>>, ..., 
                     model.mat, formula, data, frame.eval) 
resampGetL.jackknife(x, method = <<see below>>, ..., frame.eval) 
resampGetL.influence(x) 
resampGetL.bootstrap2(x, ..., frame.eval) 

REQUIRED ARGUMENTS:

x
object of class , or other object, or a function call with arguments data and statistic.

OPTIONAL ARGUMENTS:

method
a character string determines the method used to compute the L-statistic values. For most resamp objects the possible values are "jackknife" and "influence". For bootstrap objects additional choices are "ace" and "regression". Default values depend on sample sizes, number B of bootstrap replications present, and whether sampling was by group (stratified); if model.mat is present, or if B>2*n+100 the default is "ace". Otherwise "influence" or "jackknife" is used: if stratified sampling was not used, the default is "jackknife"; if stratified sampling was used the method is "influence" if the statistic can be modified to include weights (see below), "jackknife" otherwise.
model.mat
model matrix used in the linear model fit, required for methods "ace" and "regression", unless formula is supplied.
formula
a formula object, with the response on the left of a ~ operator, and the terms, separated by + operators, on the right. Used to create the model matrix.
data
a data.frame in which to interpret the variables named in the formula. By default the original data used in bootstrapping is used if it is a data frame.
...
Other arguments which may affect calculations, e.g. epsilon for and df for the "ace" methods, see .
frame.eval
frame where the data and other objects used when creating x can be found. You need to specify this if objects can't be found by their original names, or have changed; see .

VALUE:

vector or matrix containing approximate empirical influence function values for each data point. There are n rows, where n is the original number of observations or subjects; and p columns, where the statistic is p-valued. When sampling by subject the rows names of the result are the sorted unique values of the subject argument taken from the call to or .

The results are normalized to sum to zero (by group, if sampling by group; see below).

The result has a "method" attribute giving the method. For the two regression methods, the result has a "correlation" attribute giving the multiple correlation between (transformed) bootstrap replicates and the linear approximation. For the "influence" method, the result has an "epsilon" component (see below).

DETAILS:

The "influence" method calculations are carried out by , using functional differentiation with a finite value of epsilon. The statistic must accept a weights argument, or be an expression involving one or more functions that accept a weights argument.

The "jackknife" method gives the ordinary jackknife estimate for the empirical influence values. Calculations are carried out by which in turn may call jackknife.

The "regression" and "ace" methods perform regression with bootstrap replicates as the response variable. They call for calculations. These methods run faster if the indices are saved in the bootstrap object. The number of explanatory variables is n, so these methods should only be used if the number of bootstrap replications B is large enough to estimate that many parameters, say B>2*n+100.

The "ace" variant perform an initial regression, transforms the response to improve linearity, then performs a final regression.

The model.mat matrix should have one row for each observation (or for each subject). An initial column of 1's is optional (it is added if not present). It should contain columns which together have a high "multiple correlation" with the statistic of interest. For example, if the statistic is var(x), then cbind(x, x^2) or cbind(x, (x-mean(x))^2) would be suitable; these could also be specified by formulae, ~x + x^2 or ~poly(x,2), respectively. Here "multiple correlation" is between the original bootstrapped statistic ( replicates) and the (multivariate) bootstrapped sample means of the model matrix, using the same bootstrap indices. In other words, you can view each column of the model matrix as a set of data whose sample mean is bootstrapped; these sample means should have high multiple correlation with the actual statistics in order for the resulting linear approximations to be accurate.

If model.mat has k columns, the number of bootstrap replications B is large enough to estimate that many parameters, say B>2*k+100.

Similarly, for "regression" and "ace" the multiple correlation should be high for linear approximations to be accurate. The estimated multiple correlation is given as an attribute to the result. This is not adjusted for degrees of freedom, or for the transformation used by the "ace" method.

Sampling by group (stratified sampling) and by subject are supported by all methods. However, in the group case the "jackknife" method should not be used for some statistics. If the statistic would give a different value if all observations in one group were repeated twice, this indicates that the statistic does not normalize weights by group, and the jackknife estimates will be mildly or badly inaccurate. Sampling by subject can also cause problems for the "influence" method, because statistics vary in how the weight for a subject should be divided among the corresponding observations. Currently the weights for a subject are replicated to each observation, but this is subject to change.

For correct results with the influence method, all functions in the expression that depend on the data should accept a weights argument. For example, suppose the original statistic is mean(x)-median(x), where mean has a weights argument but median does not. The internal calculations create a new expression, in which weights are added to every function that accepts them:
mean(x,weights=Splus.resamp.weights)-median(x).
Results are incorrect, because weighted medians are not calculated when they should be.

median and other non-smooth functions also cause problems for methods that depend on smoothness, including "jackknife" and "influence" with a small value of epsilon; these finite-difference derivative methods are not suitable for non-smooth statistics. For such statistics use the regression methods, or "influence" with a large epsilon, e.g. epsilon=1/sqrt(n) (the "butcher knife").

REFERENCES:

Davison, A.C. and Hinkley, D.V. (1997), Bootstrap Methods and Their Application, Cambridge University Press.

Efron, B. and Tibshirani, R.J. (1993), An Introduction to the Bootstrap, San Francisco: Chapman & Hall.

Hesterberg, T.C. (1995), "Tail-Specific Linear Approximations for Efficient Bootstrap Simulations," Journal of Computational and Graphical Statistics, 4, 113-133.

Hesterberg, T.C. and Ellis, S.J. (1999), "Linear Approximations for Functional Statistics in Large-Sample Applications," Technical Report No. 86, http://www.insightful.com/Hesterberg

BUGS:

resampGetL can fail when method = "influence" if the statistic in x calls a modeling function like lm. See for details.

SEE ALSO:

, , , , .

EXAMPLES:

bfit <- bootstrap(stack.loss, mean) 
L1 <- resampGetL(bfit, "jackknife") 
# Same result using jackknife object 
jfit <- jackknife(stack.loss, mean) 
L2 <- resampGetL(jfit) 
all.equal(L1, L2) 
 
### Example: correlation for bivariate data 
set.seed(1); x <- rmvnorm(100, d=2, rho=.5) 
bfit2 <- bootstrap(x, cor(x[,1], x[,2]), save.indices=T) 
L1 <- resampGetL(bfit2)  # "ace" method 
L2 <- resampGetL(bfit2, model.mat = cbind(x, x^2, x[,1]*x[,2])) 
L2b<- resampGetL(bfit2, formula = ~poly(x,2))  # equivalent to previous 
L3 <- resampGetL(bfit2, method="jackknife") 
L4 <- resampGetL(bfit2, method="influence") 
L5 <- influence(x, cor(x[,1], x[,2]), returnL=T) 
plot(x[,1], x[,2]) 
contour(interp(x[,1], x[,2], L4), add=T) 
# points in top right and lower left have positive influence on correlation 
contour(interp(x[,1], x[,2], L1), add=T, col=2) # more random variation 
contour(interp(x[,1], x[,2], L2), add=T, col=3) # less random variation 
all.equal(L2, L2b) # identical 
all.equal(L4, L5)  # identical 
cor(cbind(L1, L2, L3, L4))  # high correlation 
# Accuracy for linear approximation: 
plot(indexMeans(L1, bfit2$indices) + bfit2$observed, bfit2$replicates, 
     xlab = "Linear approximation", ylab="Actual bootstrap values") 
abline(0,1,col=2) 
cor(indexMeans(L1, bfit2$indices), bfit2$replicates) 
# correlation .989 between bootstrap replicates and linear approximation 
attr(L1, "correlation")  # .989 
 
### Example: sampling by subject 
bfit3 <- bootstrap(fuel.frame, mean(Fuel), subject = Type, 
                   save.indices = T) 
L1 <- resampGetL(bfit3, method = "ace") 
means <- groupMeans(fuel.frame$Fuel, fuel.frame$Type) 
counts <- table(fuel.frame$Type) 
L2 <- resampGetL(bfit3, model.mat = cbind(means, counts, means*counts)) 
L3 <- resampGetL(bfit3, method="jackknife") 
L4 <- resampGetL(bfit3, method="influence") 
L5 <- resampGetL(bfit3, model.mat = cbind(means)) 
cor(cbind(L1, L2, L3, L4, L5))  # high correlation, except for L5 
# The model.mat for L5 did not provide a suitable basis 
# for predicting the bootstrap statistics (which correspond to 
# means of resampled subject means, weighted by resampled subject counts)