Random samples of indices for bootstrapping

DESCRIPTION:

Generate random indices for use by and other high-level resampling functions.

USAGE:

samp.bootstrap(     n, B, size = n - reduceSize, reduceSize = 0, prob = NULL) 
samp.boot.bal(      n, B, size = n - reduceSize, reduceSize = 0, method = "biased") 
samp.bootknife(     n, B, size = n - reduceSize, reduceSize = 0, njack = 1) 
samp.finite(        n, B, size = n - reduceSize, reduceSize = 0, N, bootknife = F) 
samp.permute(       n, B, size = n - reduceSize, reduceSize = 0, prob = NULL, full.partition = "none") 
samp.permute.old(   n, B) 
samp.combinations(  n, B, k, both = T) 
samp.half(          n, B, size = n/2 - reduceSize, reduceSize = 0) 
samp.blockBootstrap(n, B, size = n - reduceSize, reduceSize = 0, blockLength) 
blockBootstrap(blockLength) 
samp.boot.mc    is deprecated; it is the same as samp.bootstrap 
samp.MonteCarlo is deprecated; it is the same as samp.bootstrap 

REQUIRED ARGUMENTS:

n, size
population and sample sizes, respectively. Samples of size size are generated from the sequence 1:n. ( size is not required)
B
number of resamples to draw.

The remaining arguments are specific to individual samplers

k
size of the first group, for two-sample permutation problems, or for returning all combinations of k elements out of n.
N
superpopulation size, for finite-population sampling.
blockLength
length of blocks, for block bootstrapping. Must be less than or equal to n.

OPTIONAL ARGUMENTS:

reduceSize
non-negative integer, by default size = n - reduceSize. Setting reduceSize = 1 is useful for avoiding bias, see below.
prob
vector of probabilities. Index i is chosen from 1:n with probability prob[i]. The vector is normalized internally to sum to one. Error if length(prob) is not equal to n. A value of NULL implies equal probabilities for each index. A sampler that has this argument may be used for importance sampling.
method
character, one of "biased", "unbiased", and "semi"; see below.
njack
integer; create a jackknife sample with njack observations omitted, then draw a bootstrap sample from that.
full.partition
character, one of "first", "last", or "none"; Return, for each sample, the initial (if "first") or final (if "last") size elements of a full sample of size n. If "none", do not generate full samples. Valid only if size < n; ignored otherwise. See below.
bootknife
logical, if TRUE then a variation of bootknife sampling is used; one observation is omitted from the sample before forming the superpopulation. This is useful for avoiding bias, see below.
both
logical, if TRUE (the default), then return a matrix with n rows, in which the first k rows are all combinations of k elements out of n. If FALSE then return only the first k rows.

VALUE:

matrix with size or n rows and B columns in which each column is one resample, containing indices from 1:n for subscripting the original data.

SIDE EFFECTS:

These functions cause creation of the dataset .Random.seed if it does not already exist, otherwise its value is updated.

DETAILS:

These samplers are typically called multiple times by , to generate indices for a block of say B=100 replications at a time (the value of B here corresponds to the block.size argument to bootstrap ).

You may write your own sampler. A sampler must have arguments n and B. If a sampler has a prob argument then it may be used for importance sampling.

Additional arguments may be passed in three ways: (1) using the sampler.args argument to ; (2) by passing an expression such as samp.bootstrap(size = 100) (arguments set in this way override those set by sampler.args), or (3) using a "constructor" function such as blockBootstrap to create a copy of a sampler function ( samp.blockBootstrap) which has default values for additional arguments.

If importance sampling is not used, then the prob argument may be used as an additional argument, resulting in sampling from a weighted empirical distribution, but the observed statistic will not be consistent with that weighted empirical distribution; instead consider using importance sampling, then calling .

Some functions that operate on a object, including , , assume that simple random sampling with equal probabilities and size=n (or approximately n, see below) was used, and may give incorrect results if that is not the case. In other words, they expect samp.bootstrap or the similar samp.boot.bal and samp.bootknife.

AVOIDING DOWNWARD BIAS IN STANDARD ERRORS:

Bootstrapping typically gives standard error estimates which are biased downward; e.g. the ordinary bootstrap standard error for a mean is sqrt((n-1)/n) s/sqrt(n) (plus random error when B < infinity), where s = stdev(x) is the usual sample standard deviation. This is too small by a factor sqrt((n-1)/n). When stratified sampling is used, the corresponding downward bias depends on stratum sizes, and may be substantial There are two easy remedies for this: use samp.bootknife, or samp.bootstrap(reduceSize = 1). The latter sets the sampling size for each stratum to 1 less than the stratum size.

STRATIFIED SAMPLING:

For stratified sampling (the group argument to bootstrap), the sampler is called for each sampler. If size or reduceSize is used, then you must set group.order.matters = FALSE when calling bootstrap (otherwise size mismatches will occur, as the code attempts to place resampled strata in the same positions as the original data).

OVERVIEW OF THE SAMPLERS:

samp.bootstrap
provides simple bootstrap resamples, with replacement.
samp.permute
returns random permutations in the simplest case. More generally, returns random samples drawn with "minimal replacement". If, after normalizing, max(prob) <= 1/size, the indices in each sample are drawn without replacement. Thus, the default values size=n, prob=NULL generate simple permutations of 1:n. Otherwise there are floor(size*prob[i]) or ceiling(size*prob[i]) copies of index i in each sample. The algorithm ensures that the selection probabilities prob apply to the rows of the returned matrix. That is, the relative frequency of index i per row approaches prob[i] as B increases. See for details of the algorithm.

Calling samp.permute with full.partition = "first", size = m and then (after re-setting the seed) full.partition = "last", size = n-m produces complementary index samples which, when rbind-ed together, produce an equivalent set of indices with size = n. For example, if probs is not provided, the rbind-ed results form permutations of 1:n. Note, however, that this will not give the same results for multiple samples as calling samp.permute with size=n, because the algorithm for size equal to n is different than that for size not equal to n.

samp.permute.old
is provided for backward compatibility with the version of samp.permute in S-PLUS 6.0 and earlier. It is slower and less flexible, and may be removed in future versions of Spotfire S+.
samp.combinations
is useful for complete enumeration in two-sample permutation testing applications, returning all ways to divide a sample into two groups, or optionally returning only the indices for the first group. This is implemented by calling and requires that B==choose(n,k).
samp.permutations
is useful for complete enumeration in one-sample permutation testing applications, returning permutations of a sample. This is implemented by calling and requires that B==factorial(n).
samp.bootknife
provides samples of size size drawn with replacement from jackknife samples (obtained by omitting one of the values 1:n). This produces bootstrap estimates of squared standard error which are unbiased for a sample mean, with expected value s^2/n, where s^2 is the sample variance calculated with the usual denominator of (n-1). In a block of B ( block.size) observations, each observation is omitted B/n times (rounded up or down if n does not divide B).
samp.finite
does finite-population sampling. If N is a multiple of n (or of n-1, if bootknife=TRUE), then a superpopulation created by repeating each observation N/m (where m=n or m=n-1) times, and samples without replacement of size size are drawn.

If N is not a multiple of m, then superpopulations vary in size between sizes, with r copies of each original observation, where r=ceiling(M/m) or trunc(M/m) with probabilities chosen to give approximately the correct bootstrap variance for linear statistics.

samp.half
does half-samples -- of size n/2 by default. size may be half-integers; if so then alternate samples contain a zero (i.e. a smaller sample). This is a quick alternative to the ordinary bootstrap, with approximately the same standard error.
samp.blockBootstrap
does ordinary block bootstrapping (useful for time series), with overlapping blocks (no wrap-around).
blockBootstrap
simplifies using samp.blockBootstrap; see example at bottom.
samp.boot.bal
does (partially-) balanced resampling, separately within each group of resamples. This is useful for estimating the bias of a statistic, but has little effect (or worse, bias) on estimating standard errors or confidence limits.

The default "biased" method is balanced -- each observation appears exactly B times in the result. In this case size*B must be a multiple of n. It is biased because rows in its result are not independent. The bias is of order O(1/B), (where B is the block.size used in calling ) and tends to underestimate bootstrap standard errors and produce confidence intervals which are too narrow. Variances are too small by a factor of about $(1-1/B)$.

For the "unbiased" method, each row is generated independently. If n divides B then there are exactly B/n copies of 1:n in each row, and the result is balanced. Otherwise there are either floor(B/n) or ceiling(B/n) copies in each row, and the result is not exactly balanced.

For the "semi" method, if n divides B then results are exactly as for the "unbiased" method. If n divides size*B results are balanced, but there is bias, with variances biased downward by a factor of approximately (1-(B%%n)/B^2).

NOTE:

Arguments n and B should be in that order. The number and order of other arguments may change; e.g. a prob argument may be added to additional samplers to support importance sampling.

REFERENCES:

Davison, A.C. and Hinkley, D.V. (1997), Bootstrap Methods and Their Application, Cambridge University Press.

Efron, B. and Tibshirani, R.J. (1993), An Introduction to the Bootstrap, San Francisco: Chapman & Hall.

Hesterberg, T.C. (1999), "Smoothed bootstrap and jackboot sampling," Technical Report No. 87, http://www.insightful.com/Hesterberg Note - the name "jackboot" has been changed to "bootknife".

Hesterberg, T.C. (2004), "Unbiasing the Bootstrap - Bootknife Sampling vs. Smoothing", Proceedings of the Section on Statistics and the Environment, American Statistical Association, pp. 2924-2930.

SEE ALSO:

, .

For an annotated list of functions in the S+Resample package, including see: .

EXAMPLES:

samp.bootstrap(6, 8) 
samp.bootstrap(6, 8, size=12) 
samp.boot.bal(6, 8) # method = "biased" 
samp.boot.bal(6, 8, method = "unbiased") 
samp.boot.bal(6, 8, method = "semi") 
samp.permute(6, 8) 
samp.permute(6, 8, prob=(1:6)) 
samp.permute(6, 8, size=12, prob=(1:6)) 
samp.combinations(6, choose(6,4), 4) 
samp.combinations(6, choose(6,4), 4, both=F) 
samp.permutations(4, factorial(4)) 
samp.bootknife(6, 8) 
samp.bootknife(6, 8, size=12) 
samp.half(6, 8) 
samp.half(5, 8) 
 
# Block bootstrapping 
bootstrap(1:25, mean) 
bootstrap(1:25, mean, sampler = blockBootstrap(5), seed=0) 
# Previous line is equivalent to next two: 
bootstrap(1:25, mean, sampler = samp.blockBootstrap, 
                      sampler.args = list(blockLength = 5), seed=0) 
# The data are positively correlated, so block versions give 
# larger standard errors. 
 
# Compare versions of balanced bootstrapping 
set.seed(0) 
tabulate(samp.boot.bal(6, 8)) # balanced 
tabulate(samp.boot.bal(6, 8, method = "unbiased")) # not balanced 
tabulate(samp.boot.bal(6, 8, method = "semi")) # balanced 
temp <- bootstrap(1:5, mean, block.size=9, seed=0) 
temp$estimate$SE 
update(temp, sampler = samp.boot.bal)$estimate$SE # smaller 
update(temp, sampler = samp.boot.bal, 
       sampler.args = list(method = "semi"))$estimate$SE