Calculate linear approximations using regression on bootstrap samples

DESCRIPTION:

Calculate regression approximation to influence function values, using bootstrap (or other resampling) replicates and indices.

USAGE:

linearApproxReg(replicates, indices, n=max(indices), 
                model.mat, formula, data, 
                group, subject, weights, 
                transform=T, df=3, details=T, ...) 

REQUIRED ARGUMENTS:

replicates
matrix containing the bootstrapped statistic values, with B rows (the number of bootstrap samples) and one or more columns (for univariate or multivariate statistics).
indices
matrix containing resampling indices, with B columns, and normally with n rows (the number of observations or subjects in the original data).
n
number of observations (or subjects, if sampling by subject). By default this is set to max(indices), but it is better to supply it.

OPTIONAL ARGUMENTS:

model.mat
model matrix, with one row for each observation (or subject) and columns which together will yield a high multiple correlation with the statistic of interest.
formula
a formula object, with the response on the left of a ~ operator, and the terms, separated by + operators, on the right. Used to create the model matrix; if supplied then model.mat is ignored.
data
a data.frame in which to interpret the variables named in the formula.
group
the group vector, if the original resampling was by group.
subject
the subject vector, if the original resampling was by subject.
weights
vector of length B, importance sampling weights.
transform
logical, if TRUE (the default) then after an initial regression transform the response variable (the replicates) to obtain a more linear relationship with the predicted values, and perform a second regression.
df
degrees of freedom to for the transformation; this is passed to .
details
logical, if TRUE (the default) then attach the multiple correlation of the (transformed) replicates and the predicted values as an attribute when returning the linear approximation values.
...
not currently used.

VALUE:

vector or matrix containing approximate empirical influence function values for each data point.

In the univariate case a vector L such that
replicates[i] ~= c + mean(L[indices[,i]])
where c is the statistic value for the observed data.

In the multivariate case this relationship holds for each column.

There are n rows, where n is the original number of observations or subjects; and p columns, where the statistic is p-valued. In the subject case the rows names of the result are the unique values of the subject argument taken from the call to or .

The results are normalized to sum to zero (by group, if sampling by group; see below).

If details==TRUE the result has a "correlation" attribute giving the multiple correlation between (transformed) bootstrap replicates and the linear approximation.

DETAILS:

This function is normally called by resampGetL.bootstrap, but may also be called directly.

The model.mat matrix should have one row for each observation (or for each subject). An initial column of 1's is optional (it is added if not present). It should contain columns which together have a high "multiple correlation" with the statistic of interest. For example, if the statistic is var(x), then cbind(x, x^2) or cbind(x, (x-mean(x))^2) would be suitable. Here "multiple correlation" is between the original bootstrapped statistic ( replicates) and the (multivariate) bootstrapped sample means of the model matrix, using the same bootstrap indices. In other words, you can view each column of the model matrix as a set of data whose sample mean is bootstrapped; these sample means should have high multiple correlation with the actual statistics in order for the resulting linear approximations to be accurate.

The indices argument normally has n rows. However, it may have more or less, when bootstrap sampling with size not equal to the original sample size. Or, in permutation testing for two-sample problems, this may be the indices corresponding to just one of the samples.

REFERENCES:

Davison, A.C. and Hinkley, D.V. (1997), Bootstrap Methods and Their Application, Cambridge University Press.

Efron, B. and Tibshirani, R.J. (1993), An Introduction to the Bootstrap, San Francisco: Chapman & Hall.

Hesterberg, T.C. (1995), "Tail-Specific Linear Approximations for Efficient Bootstrap Simulations," Journal of Computational and Graphical Statistics, 4, 113-133.

Hesterberg, T.C. and Ellis, S.J. (1999), "Linear Approximations for Functional Statistics in Large-Sample Applications," Technical Report No. 86, http://www.insightful.com/Hesterberg

SEE ALSO:

, , .

EXAMPLES:

### Example: correlation for bivariate data 
set.seed(1); x <- rmvnorm(100, d=2, rho=.5) 
bfit2 <- bootstrap(x, cor(x[,1], x[,2]), save.indices=T) 
L1 <- resampGetL(bfit2)  # "ace" method 
L2 <- resampGetL(bfit2, model.mat = cbind(x, x^2, x[,1]*x[,2])) 
L3 <- linearApproxReg(bfit2$replicates, bfit2$indices) 
L4 <- linearApproxReg(bfit2$replicates, bfit2$indices, 
                      model.mat = cbind(x, x^2, x[,1]*x[,2])) 
# L1 and L3 are identical; L2 and L4 are identical.