Generate Random Samples or Permutations of Data

DESCRIPTION:

Generate a random sample of size observations from the population, or a sample from the integers 1 to n. This is a generic function; methods exist for data.frame, bdVector, bdFrame, and seriesVirtual; the default method is for vectors.

USAGE:

sample(x, size = n, replace = F, prob = NULL, n = NULL, ...)
rsample(n, size = n, replace = F, prob = NULL,
        bigdata = F, minimal = NULL, ..., order = T)

REQUIRED ARGUMENTS:

x
vector, data frame, bdVector, bdFrame, or other object giving a population to sample from, or a positive integer giving the size of the population n (then the population is 1:n). Missing values ( NAs) are allowed.

OPTIONAL ARGUMENTS:

size
sample size. The default is the same as the population size, and thus (with replace=FALSE) generates a random permutation.
replace
if TRUE, sampling is done with replacement; otherwise sampling is without replacement or with minimal replacement, see below.
prob
vector or bdVector of probabilities of length n, giving probabilities of selection for each of the elements of x. The elements of prob will be normalized to sum to one. The default NULL gives equal probabilities for each element of the population.
n
Positive integer giving the size of the population; a sample is drawn from 1:n. This may be given in place of x, in order to avoid ambiguity. Only one of x and n should be supplied.
...
additional arguments, which are passed to rsample.
bigdata
logical, if TRUE then a bdVector is returned; default is FALSE. The specific sample produced may differ based on this setting.
minimal
logical, if TRUE sampling is to be done with minimal replacement rather than no replacement (this argument is ignored if replace=TRUE), and a different algorithm is used for sampling with unequal probabilities. Default is TRUE when sampling with unequal probabilities. See details below.
order
logical, if TRUE (the default) then the output should be randomly ordered. If FALSE then some methods will skip a final step to randomly reorder output; particularly for big data this is faster.

VALUE:

if n is supplied (or if x is a positive integer) then a sample from 1:n. This is a bdVector if bigdata=T, otherwise an ordinary vector.

If x is a data frame or a bdFrame, then the result is a sample of the rows.

If x is a vector or bdVector, then the result is a sample of the observations.

SIDE EFFECTS:

The function sample causes creation of the dataset .Random.seed if it does not already exist, otherwise its value is updated.

DETAILS:

Most of the methods for sample call rsample, then use those values to subscript x.

To generate a sample from 1:n, we recommend either using rsample, or calling sample using the argument n rather than x, as the latter is ambiguous. However, for backward compatibility, you can still let the argument x be an integer giving the sample size.

n cannot be larger than the largest positive integer on the machine ( .Machine$integer.max, 2147483647 on a 32 bit computer).

If x represents a population, it can be any object with a length and for which subscripting works. For example, it can be a vector of character strings. Missing values in x are treated like any other value.

If replace=TRUE, sampling is done with replacement.

If replace=FALSE and prob is not supplied, then minimal=FALSE gives sampling without replacement; an error occurs if size>n. If minimal=TRUE, then sampling is with minimal replacement--if size>n then each observation is included size %/% n times, then the remaining draws taken without replacement.

If prob is supplied and replace=FALSE, then the algorithm depends on the value of minimal. The default is minimal=TRUE. Specify minimal=FALSE to get the same random values as in S-PLUS versions 6.2 and earlier.

If minimal=FALSE, then values are drawn sequentially with probabilities proportional to prob, excluding elements already drawn. If n>1, this does not give overall selection probabilities proportional to prob; the actual selection probabilities are between those implied by prob and equal probabilities. Different permutations of the same set of outcomes also have different probabilities of being chosen (the order argument is ignored). If size>n an error occurs.

If minimal=TRUE, then the probability of selecting observation j equals size*j, and every permutation of a set of outcomes has the same probability of being chosen (assuming order=T). The algorithm randomly permutes the values and corresponding probabilities, divides the unit interval into blocks of length proportional to prob, does a systematic sample of size random numbers uniformly on the unit interval, selects the observations for which the uniform numbers fall into the corresponding blocks, then does a final random permutation (if order=TRUE). "Minimal replacement" is not taken literally in the unequal probability case - if size*max(prob)>1 then duplicates may occur, and if >2 then duplicates are guaranteed.

If x is a seriesVirtual object ( timeSeries, signalSeries, bdTimeSeries, or bdSignalSeries) the sampled (big data) time series or signal series sorted in the ascending order of their positions, unless order=T is specifically given

SEE ALSO:

to generate uniformly distributed real numbers, , .

EXAMPLES:

sample(state.name, 10)   # pick 10 unique states at random 
sample(1e6, 75)  # pick 75 numbers between 1 and one million 
sample(50)    # random permutation of numbers 1:50 
sample(0:1, 100, T, c(.3,.7))   # Bernoulli(.3) sample of size 100 
# 20 uniformly distributed numbers on the integers 1:10 with replacement 
sample(10, 20, replace=T) 
sample(10, 20, minimal=T)  # each observation twice
sample(5, 20, prob=c(.3,.4,.1,.1,.1), replace=T) 
sample(5, 20, prob=c(.3,.4,.1,.1,.1)) # minimal=T by default