size
observations from the population,
or a sample from the integers 1 to
n
.
This is a generic function; methods exist for
data.frame
,
bdVector
,
bdFrame
,
and
seriesVirtual
;
the default method is for vectors.
sample(x, size = n, replace = F, prob = NULL, n = NULL, ...) rsample(n, size = n, replace = F, prob = NULL, bigdata = F, minimal = NULL, ..., order = T)
bdVector
,
bdFrame
, or
other object giving a population to sample from,
or a positive integer giving the size of the population
n
(then the population is
1:n
).
Missing values (
NA
s) are allowed.
replace=FALSE
) generates a random
permutation.
TRUE
, sampling is done with replacement;
otherwise sampling is without replacement or with minimal
replacement, see below.
bdVector
of probabilities
of length
n
, giving probabilities of selection for
each of the elements of
x
.
The elements of
prob
will be normalized to sum to one.
The default
NULL
gives equal probabilities for each element of the population.
1:n
.
This may be given in place of
x
, in order to avoid
ambiguity.
Only one of
x
and
n
should be supplied.
rsample
.
TRUE
then a
bdVector
is returned;
default is
FALSE
. The specific sample produced may differ
based on this setting.
TRUE
sampling is to be done with minimal replacement rather than no
replacement (this argument is ignored if
replace=TRUE
),
and a different algorithm is used for sampling
with unequal probabilities.
Default is
TRUE
when sampling with unequal probabilities.
See details below.
TRUE
(the default) then the output should
be randomly ordered. If
FALSE
then some methods will skip a final step to
randomly reorder output; particularly for big data this is faster.
n
is supplied (or if
x
is a positive integer)
then a sample from
1:n
. This is a
bdVector
if
bigdata=T
, otherwise an ordinary vector.
If
x
is a data frame or a
bdFrame
, then
the result is a sample of the rows.
If
x
is a vector or
bdVector
, then the
result is a sample of the observations.
sample
causes creation of the dataset
.Random.seed
if it does not already exist, otherwise its value is updated.
Most of the methods for
sample
call
rsample
,
then use those values to subscript
x
.
To generate a sample from
1:n
, we recommend either
using
rsample
, or calling
sample
using the
argument
n
rather than
x
, as the latter is
ambiguous. However, for backward compatibility, you can still
let the argument
x
be an integer giving the sample size.
n
cannot be larger than the largest positive integer on
the machine (
.Machine$integer.max
, 2147483647 on a 32 bit
computer).
If
x
represents a population, it can be any object
with a length and for which subscripting works.
For example, it can be a vector of character strings.
Missing values in
x
are treated like any other value.
If
replace=TRUE
, sampling is done with replacement.
If
replace=FALSE
and
prob
is not supplied, then
minimal=FALSE
gives
sampling without replacement; an error occurs if
size>n
.
If
minimal=TRUE
, then
sampling is with minimal replacement--if
size>n
then
each observation is included
size %/% n
times,
then the remaining draws taken without replacement.
If
prob
is supplied and
replace=FALSE
, then the
algorithm depends on the value of
minimal
. The default
is
minimal=TRUE
. Specify
minimal=FALSE
to
get the same random values as in S-PLUS versions 6.2 and earlier.
If
minimal=FALSE
, then values are drawn sequentially
with probabilities proportional to
prob
, excluding
elements already drawn. If
n>1
, this does not
give overall selection probabilities proportional to
prob
;
the actual selection probabilities are between those implied by
prob
and equal probabilities. Different permutations
of the same set of outcomes also have different probabilities of
being chosen (the
order
argument is ignored).
If
size>n
an error occurs.
If
minimal=TRUE
, then the probability of selecting
observation
j
equals
size*j
, and every
permutation of a set of outcomes has the same probability of being chosen
(assuming
order=T
).
The algorithm randomly permutes the values and corresponding probabilities,
divides the unit interval into blocks of length proportional to
prob
,
does a systematic sample of
size
random numbers uniformly
on the unit interval,
selects the observations for which the uniform numbers fall into the
corresponding blocks, then does a final random permutation
(if
order=TRUE
).
"Minimal replacement" is not taken literally in the unequal probability
case - if
size*max(prob)>1
then duplicates may occur,
and if
>2
then duplicates are guaranteed.
If
x
is a
seriesVirtual
object
(
timeSeries
,
signalSeries
,
bdTimeSeries
, or
bdSignalSeries
)
the sampled (big data) time series or signal series
sorted in the ascending order of their positions, unless
order=T
is specifically given
sample(state.name, 10) # pick 10 unique states at random sample(1e6, 75) # pick 75 numbers between 1 and one million sample(50) # random permutation of numbers 1:50 sample(0:1, 100, T, c(.3,.7)) # Bernoulli(.3) sample of size 100 # 20 uniformly distributed numbers on the integers 1:10 with replacement sample(10, 20, replace=T) sample(10, 20, minimal=T) # each observation twice sample(5, 20, prob=c(.3,.4,.1,.1,.1), replace=T) sample(5, 20, prob=c(.3,.4,.1,.1,.1)) # minimal=T by default