Issues, Problems and Workarounds for Resampling Functions

DESCRIPTION:

This is a collection of issues and problems (and workarounds) with resampling functions, including , , and . If you encounter other problems, browse to http://spotfire.tibco.com/support and register for an account.

Sections in this file:
Randomness
Non-functional statistics
Mismatch in number of observations
Scoping problems (this has a number of subsections)
Intermittent Statistic Failure (two subsections)

RANDOMNESS:

Bootstrap results are random, and results depend on the random number seed used, number of blocks used, order of the data, and names of groups and subjects (the sorted names determine the order in which sample indices are drawn). These factors normally cause only small differences in bootstrap results. You can reduce this by increasing the number B of bootstrap replications.

NON-FUNCTIONAL STATISTICS:

Many of the resampling functions implicitly assume that the statistic is "functional" -- that it depends only on the empirical distribution (assuming equal probabilities on all observations), not on addition information such as the sample size. A functional statistic would return the same value if all observations were repeated the same number of times. Examples of statistics which are not functional include modeling functions that use smoothing parameters that depend on n, and var() when called without weights and with unbiased=T. You should exercise care when calling a non-functional statistic, as the assumptions that underlying resampling methods may be violated. Also note that some functions assign weights, which may cause the behavior of the statistic to change. For example, by default var() is normally not functional, but can be made so by specifying unbiased=F or by supplying weights:

var(1:5)                        # 2.5 -- not functional by default 
var(rep(1:5,2))                 # 2.22222 
var(1:5, unbiased=F)            # 2   -- this gives the 
functional version 
var(rep(1:5,2), unbiased=F)     # 2 
var(1:5, weights=rep(1/5,5))    # 2   -- weights force 
functional version 
var(rep(1:5,2), weights=rep(1/10,10))  # 2 

Some resampling functions add weights when calling the statistic (e.g. influence), others do not (bootstrap and jackknife).
bootstrap(1:5, var, B=3)$observed # 2.5 
jackknife(1:5, var, B=3)$observed # 2.5 
influence(1:5, var)$observed      # 2 -- is calculated with weights 

The results are self-consistent within a function, because both the observed value and all replicates are computed the same way (with or without weights). However, results differ across functions, both the observed values and output quantities -- jackknife indicates that var is unbiased (which is true for the default calculations for var without weights, if the data are independent and identically distributed), while influence indicates that it is biased (which is true for the functional form of var).

MISMATCH IN NUMBER OF OBSERVATIONS:

This problem affects jackknife and functions that call jackknife, including limits.bca and summary. If some but not all vectors used by the statistic are contained in the data argument, then only the vectors included in data have observations omitted.

West <- state.region) == "West" 
Income <- state.x77[,"Income"]  
# jackknife(Income, mean(Income[West])-mean(Income[!West]))  # fails 

That jackknife call fails. bootstrap works, but limits.bca or summary will fail:
bs <- bootstrap(Income, mean(Income[West])-mean(Income[!West]), group = West)  
# limits.bca(bs) # fails 
# summary(bs)    # fails 

One workaround is to include all vectors in the data. For example:
myData <- data.frame(West = (state.region == "West"), 
                     Income = state.x77[,"Income"]) 
jackknife(myData, mean(Income[West])-mean(Income[!West])) 

Alternately, to make summary and limits.bca work after bootstrapping, you may avoid jackknife by using another method to calculate L (which is used to calculate acceleration); see .

Incidentally, makes bootstrapping the difference in two means much easier.

SCOPING PROBLEMS:

This refers to the general problem of a function not finding data because the data is not in the search path for that function. The problems below are illustrated for , but also apply to and possibly other resampling functions.

SCOPING PROBLEM - when calling from a function:

Bootstrap or jackknife may fail when called from within another function. Here is an example.

# Scoping problem when bootstrap is called from a function, 
# and statistic is an expression that uses objects defined in 
# the function. 
# 
fun1 <- function(){ 
  x <- 1:20 
  fun2 <- function(x) mean(x, trim=.2) 
  bootstrap(data = x, statistic = fun2(x)) 
 } 
fun1() 

This fails with the message that function fun2 (which is defined in the frame of fun1) could not be found. One workaround is to assign fun2 to frame 1 before calling bootstrap:
fun1 <- function() 
{ 
  x <- 1:20 
  fun2 <- function(x) mean(x, trim=.2) 
  assign("fun2", fun2, frame = 1)  # workaround 
  bootstrap(data = x, statistic = fun2(x)) 
} 
fun1() 

One additional caution -- if you don't use the workaround, and fun2 is defined on a permanent database, then the permanent copy is used (not the copy inside fun1).

This scoping problem occurs when argument statistic is an expression. When statistic is a name or a function, it is not necessary to assign to frame 1 in most cases. An alternate workaround is therefore to replace "statistic = fun2(x)" by "statistic = fun2".

An exception occurs when bootstrapping an lm object:

SCOPING PROBLEM - bootstrap.lm:

fun1 <- function() 
{ 
  x <- 1:20 
  set.seed(0) 
  y <- sort(runif(20)) 
  fit <- lm(y~x) 
  fun2 <- function(x) coef(x) 
  bootstrap(data = fit, statistic = fun2) 
} 
fun1() 

This also fails, even though statistic is a name. The hidden behavior here is that converts statistic to an expression before calling the default method for . Assigning fun2 to frame 1 overcomes the error.

SCOPING PROBLEM - data:

A similar scoping problem can occur with the data argument, when data is a fitted model object defined in a frame other than that containing the call to bootstrap. Here is an example using .

fun1 <- function() 
{ 
  x <- 1:20 
  set.seed(0) 
  y <- sort(runif(20)) 
  fit <- lm(y~x) 
  fun2 <- function(fit) bootstrap(fit, coef, B=100) 
  fun2(fit) 
} 
fun1() 

Here bootstrap is called from fun2 but the objects x and y, are defined in the frame of fun1. While they do not appear directly in the call to bootstrap, they are used indirectly. So again one workaround is to assign x and y to frame 1 before calling fun2. Another solution is to pass x and y to fun2:
fun1 <- function() 
{ 
  x <- 1:20 
  set.seed(0) 
  y <- sort(runif(20)) 
  fit <- lm(y~x) 
  fun2 <- function(fit, x, y){ 
    # make sure that copies of x and y are defined in fun2 
    x <- x; y <- y 
    bootstrap(fit, coef, B=100) 
  } 
  fun2(fit, x, y) 
} 
fun1() 

SCOPING PROBLEM - using the data argument in a modeling function:

Consider the following example using jackknife.

jackknife(data = kyphosis, 
          statistic = coef(glm(Kyphosis ~ kyphosis[[3]], data = kyphosis))) 

This causes the error
Problem in eval(statistic, c(list(kyphosis = data), ..: Length of 
kyphosis[[3]] (variable 2) is 81 != length of others (80) 

This known bug in Spotfire S+ occurs during the evaluation of the statistic by . There are two different types of references to the data in the formula argument to glm. The first variable, Kyphosis, is the name of the first variable in the kyphosis data set. The second variable, kyphosis[[3]], references the third column of kyphosis by index rather than name. It is the latter which is mis-handled -- the third column of the full dataset is used rather than that of the jackknifed (one row-deleted) data.

There are several workarounds. One is to use variable names rather than column indices: the third column of kyphosis is "Number", so we can obtain the desired results using

jackknife(data = kyphosis, 
          statistic = coef(glm(Kyphosis ~ Number, data = kyphosis))) 

Another workaround is to avoid using the data argument to glm (it is redundant, since the data argument to jackknife specifies the data to be used).

jackknife(data = kyphosis, 
          statistic = coef(glm(Kyphosis ~ kyphosis[[3]]))) 

Another workaround is to use assign.frame1 = T, to put the jackknifed version of the data on frame 1 where it hides the original.
jackknife(data = kyphosis, assign.frame1 = T, 
          statistic = coef(glm(Kyphosis ~ kyphosis[[3]], data = kyphosis))) 

A final workaround is to use the new glm method for jackknife, which has the additional benefit of generally executing faster.
jackknife(data = glm(Kyphosis ~ kyphosis[[3]], data = kyphosis), 
          statistic = coef) 

The same problem occurs with bootstrap, though it is more insidious. The data in the call
boot.obj <- bootstrap(data = kyphosis, 
            statistic = coef(glm(Kyphosis ~ kyphosis[[3]], data = 
            kyphosis)), B = 100, seed = 0) 

is mishandled in the same way as the original jackknife example. But there is no error message since, for bootstrap, the original data and the resampled data are both the same size. The results, however, are wrong. An attempt to use summary on boot.obj,
summary(boot.obj)
causes the same jackknife-type error message we got before, because jackknife is called by summary to evaluate BCa limits. It is important to remember, however, that boot.obj contains incorrect results, independent of any future call to summary. Any of the above workarounds for jackknife can be used to get correct results for bootstrap.

SCOPING PROBLEM - modeling functions:

In influence, resampGetL, limits.tilt, and limits.abc. This is another manifestation of the known Spotfire S+ scoping bug mentioned above. For example,

xy <- data.frame(x = 1:10, y = sort(runif(10))) 
influence(xy, coef(lm(y~x, data = xy)))  # fails 

fails with a message about not finding weights. As with the jackknife bugs in the previous section, the data argument to lm is redundant here, and the error disappears if we get rid of it.
influence(xy, coef(lm(y~x))) # works
Setting assign.frame1 = T also gets around the error.
influence(xy, coef(lm(y~x)), assign.frame1 = T) # works
Similar problems occur with the "influence" method of resampGetL, which invokes influence. Thus the following fails.
bfit <- bootstrap(xy, coef(lm(y~x, data = xy)), B=20, seed = 0) 
resampGetL(bfit, method="influence")  # fails 

We can workaround this error by getting rid of the data argument.
bfit <- bootstrap(xy, coef(lm(y~x)), B=20, seed = 0) 
resampGetL(bfit, method="influence")  # works 

Or we can use the lm method for bootstrap.
bfit <- bootstrap(lm(y~x, data = xy), coef, B=20, seed = 0) 
resampGetL(bfit, method="influence")  # nope, works but all zeros 

Note that the other methods for resampGetL ( "jackknife", "ace", etc.) do not have these problems.

The same scoping problem manifests itself in , , and when the statistic involves modeling functions. For example,

bfit <- bootstrap(xy, coef(lm(y~x, data = xy)), B=20, seed = 0) 
limits.tilt(bfit) # fails 
tiltAfterBootstrap(bfit) # fails 
limits.abc(fuel.frame, coef(lm(cform, data = fuel.frame))) # fails 

The same cures work for these functions.

INTERMITTENT STATISTIC FAILURE - nls:

Here is an example where fails because the statistic sometimes fails.

# fit Michaelis and Menten's original data. 
conc   <- c(0.3330, 0.1670, 0.0833, 0.0416, 0.0208, 0.0104, 0.0052) 
vel    <- c(3.636, 3.636, 3.236, 2.666, 2.114, 1.466, 0.866) 
Micmen <- data.frame(conc=conc, vel=vel) 
param(Micmen,"K")  <- 0.02; 
param(Micmen,"Vm") <- 3.7 
fit <- nls(vel~Vm*conc/(K+conc), data = Micmen) 
set.seed(0) 
bootstrap(Micmen, coef(nls(vel~Vm*conc/(K+conc), data = Micmen)), B=200, 
          seed=0) 
# fails, step factor below minimum on replication 144 

One workaround is to construct a new statistic that can detect failures in nls and to simply return NA for those samples. The function try can be used for this purpose.
# Use try().  If the result is of class "Error", then return 
# rep(NA, same length as other replications) 
try.expr <- Quote({result <- try(coef(nls(vel~Vm*conc/(K+conc), 
                                data = Micmen))) 
                  if(is(result, "Error")) rep(NA,2) else result 
                }) 
bootstrap(Micmen, try.expr, B=200, seed=0) 

The error still occurs for the bad sample, but the remaining samples can be processed.

INTERMITTENT STATISTIC FAILURE:

The same type of problem can occur with other modeling functions. One example occurs with using factor data. If one of the factor categories is relatively rare, some bootstrap samples will not contain that category.

abc <- data.frame(x=1:20, y=sort(runif(20)), 
                abc=factor(c(rep("a",8), rep("b",9), rep("c",3)))) 
bootstrap(lm(y~x+abc, data=abc), coef, B=100, seed = 0) 

An error occurs in due to the singular fit. It does not help to specify singular.ok=T in the call to :
bootstrap(lm(y~x+abc, data=abc, singular.ok = T), coef, B=100, 
          seed = 0) 

Now complains that statistic returns results with varying length, because returns fewer coefficients on those samples lacking category "c". One solution is to use singular.ok=F and to handle failures using , as above.
try.expr <- Quote({result <- try(coef(data)) 
                   if(is(result, "Error")) rep(NA, 4) else result}) 
bootstrap(lm(y~x+abc, data=abc), try.expr, B=100, seed = 0) 

Another solution is to use singular.ok = T and rather than .
bootstrap(lm(y~x+abc, data=abc, singular.ok = T), coef.default, B=100, 
          seed = 0) 

This works because (which actually dispatches to ) only returns coefficients corresponding to the non-singular part of the model, while returns all coefficients, including NA for the singular parts.

As an aside, note that simply extracts the coefficients component from the object. Therefore an equivalent workaround is

bootstrap(lm(y~x+abc, data=abc, singular.ok = T), data$coefficients, B=100, 
          seed = 0) 

MASKED STATISTIC:

If you create a numeric object named mean, then this fails:

bootstrap(x, mean) 

The solution is to remove your object which is masking mean. Similarly for other functions. You can check for these using
masked() 

UPDATE FAILS:

and some methods for and create objects with a modified call component, which prevents from working properly. For example:

fit <- lm(Fuel ~ Weight, data = fuel.frame) 
boot1 <- bootstrap(fit, coef, B=30, seed=0) 
boot2 <- update(boot1, lmsampler = "residuals") 

That gives warning about unrecognized arguments. The reason is that modifies the call:
boot1$call 
# bootstrap(data = lm(formula = Fuel ~ Weight, data = fuel.frame, 
#           method =  "model.list"), statistic = coef.default(lm(data)), 
#           B = 30, seed = 0) 

The first workaround is to give the command by hand:
boot2b <- bootstrap(fit, coef, B=30, seed = 0, lmsampler = "residuals") 

The second is to replace the call component with the actual call, then update:
boot1b <- boot1  # work with a copy in case you need boot1 later 
boot1b$call <- boot1$actual.calls[[1]] 
boot2c <- update(boot1b, lmsampler = "residuals") 

The two methods yield the same results (if using the same seed):
all.equal(boot2b, boot2c) # TRUE