Empirical influence values

DESCRIPTION:

Calculate empirical influence values and related quantities

USAGE:

influence(data, statistic, args.stat,  
          group, subject,  
          label, statisticNames, 
          assign.frame1 = F, weights, 
          epsilon = 0.001, unbiased = F, returnL = F, 
          save.group, save.subject,  
          subjectDivide = F, modifiedStatistic) 

REQUIRED ARGUMENTS:

data
data; may be a vector, matrix, or data frame.
statistic
statistic to be calculated; a function or expression that returns a vector or matrix. Not all expressions work; see details below. It may be a function which accepts data as the first argument and has an argument named weights; other arguments may be passed using args.stat.
Or it may be an expression such as mean(x,trim=.2). If data is given by name (e.g. data=x) then use that name in the expression, otherwise (e.g. data=air[,4]) use the name data in the expression. If data is a data frame, the expression may involve variables in the data frame.

OPTIONAL ARGUMENTS:

args.stat
list of other arguments, if any, passed to statistic when calculating the statistic.
group
vector of length equal to the number of observations in data, for stratified sampling or multiple-sample problems. Sampling is done separately for each group (determined by unique values of this vector). If data is a data frame, this may be a variable in the data frame, or expression involving such variables.
subject
vector of length equal to the number of observations in data; if present then subjects (determined by unique values of this vector) are resampled rather than individual observations. If data is a data frame, this may be a variable in the data frame, or an expression involving such variables. If group is also present, subject must be nested within group (each subject must be in only one group).
label
character, if supplied is used when printing, and as the main title for plotting.
statisticNames
character vector of length equal to the number of statistics calculated; if supplied is used as the statistic names for printing and plotting.
assign.frame1
logical flag indicating whether the resampled data should be assigned to frame 1 before evaluating the statistic. Try assign.frame1=T if all estimates are identical (this is slower).
weights
a vector of length equal to the number of observations (or subjects). The empirical influence function is calculated at the empirical distribution with these probabilities (normalized to sum to 1) on the observations or subjects. When sampling by subject these may be observation weights or subject weights. In the latter case, the vector may be named, in which case the names must correspond to the unique values of subject. Otherwise the weights are taken to be ordered with respect to the sorted values of subject. If data is a data frame, this may be a variable in the data frame, or an expression involving such variables. The default implies equal weights.
epsilon
small value used for numerical evaluation of derivatives.
unbiased
logical value; if TRUE then standard error estimates are computed using a divisor of (n-1) instead of n; then squared standard error estimates are more nearly unbiased.
returnL
logical flag, if TRUE then only the L matrix is returned, rather than the list described below.
save.group, save.subject
logical flags, if TRUE then group and subject vectors, respectively, are saved in the returned object. Both defaults are TRUE if n<=10000.
subjectDivide
logical flag, meaningful only if sampling by subject. Internal calculations involve assigning weights to subjects; if TRUE then the weight for each subject is divided among observations for that subject before calculating the statistic; if FALSE the subject weight is replicated to observations for that subject. Also, if TRUE and weights contains observation weights, then initial subject weights will be the sums of weights for the observations.
modifiedStatistic
if your statistic is an expression that calls a function with a "hidden" weights argument, then pass this to indicate how to call your function. See below.

VALUE:

object of class c("influence", "resamp"), with components call, observed, replicates estimate, B, n, dim.obs, L, epsilon, defaultLabel, and perhaps (depending on whether sampling by group, subject, etc.) label, groupSizes, group, subject, modifiedStatistic, replicates2, and epsilon2. see for components not described below:
replicates
value of statistic evaluated at distance epsilon in each direction from weights. If sampling by subject, the rows are named with the unique values of subject.
L
the empirical influence function values. If sampling by subject, the rows are named with the unique values of subject. Includes attributes "method" (which is set to "influence") and "epsilon".
estimate
data frame with columns containing the mean of the replicates, and estimated bias and standard error. In addition, if weights is missing, columns containing estimates of acceleration, z0, and cq used by other bootstrap procedures.

DETAILS:

The empirical influence values measure the effect on statistic of perturbing the empirical (weighted) distribution represented by data . The ith influence value is essentially the derivative in the "direction" of the i'th observation (or subject, if sampling by subject). The derivatives are approximated with finite difference quotients by reweighting the original distribution.

The name "Splus.resamp.weights" is reserved for internal use by influence. To avoid naming conflicts, that name can not be used as a variable name in data, if data is a data frame.

When statistic is an expression, for example mean(x), a modified expression mean(x, weights = Splus.resamp.weights) is created. Only calls to functions that have an argument named weights are modified; e.g. sum(x)/length(x) would fail because sum does not have a weights argument. If your expression calls a function with a "hidden" weights argument, e.g. you may pass weights as part of the ... list, then use the modifiedStatistic argument to specify that, e.g. modifiedStatistic = myFun(x, weights = Splus.resamp.weights). An expression such as mean(y[a==1]) is converted to mean(y[a==1], weights = Splus.resamp.weights) which will fail because the weights vector was not subscripted along with y. In cases such as these pass a function that performs the desired calculations, or use
modifiedStatistic = mean(y[a==1], weights = Splus.resamp.weights[a==1])

For statistics which are not smooth functions of weights, derivatives calculated using small values of epsilon will be unstable. Consider a larger value of for such statistics, e.g. epsilon=1/sqrt(n) (the "butcher knife").

REFERENCES:

Davison, A.C. and Hinkley, D.V. (1997), Bootstrap Methods and Their Application, Cambridge University Press.

Efron, B. (1982), The Jackknife, the Bootstrap and Other Resampling Plans, Society for Industrial and Applied Mathematics, Philadelphia.

Efron, B. and Tibshirani, R.J. (1993), An Introduction to the Bootstrap, San Francisco: Chapman & Hall.

Hesterberg, T.C. (1995), "Tail-Specific Linear Approximations for Efficient Bootstrap Simulations," Journal of Computational and Graphical Statistics, 4, 113-133.

BUGS:

influence can fail if statistic calls a modeling function like lm. See for details.

SEE ALSO:

and do many similar calculations.

More details on many arguments, see .

Print, summarize, plot: , , , .

Description of the object, extract parts: , , .

Confidence intervals: , .

Modify an "influence" object: .

For an annotated list of functions in the package, including other high-level resampling functions, see: .

EXAMPLES:

# Influence in robust estimation 
set.seed(1); x <- rcauchy(40) 
influence.obj <- influence(x, location.m) 
plot(x, influence.obj$L)  # outliers have less influence 
 
# influence function is useful for linear approximations 
obj <- bootstrap(x, location.m, B=200, save.indices=T) 
plot(indexMeans(influence.obj$L, obj$indices), 
     obj$replicates) 
 
# Use extra quantities for BCa interval 
limits.bca(obj, acceleration = influence.obj$estimate$accel, 
           z0 = influence.obj$estimate$z0) 
 
# Sampling by subject (type of auto) 
influence(fuel.frame, mean(Fuel), subject = Type)$L