Partition Data

DESCRIPTION:

Randomly sample the rows of your data set to partition it into three subsets for training, testing, and validating your models.

This function requires the bigdata library section to be loaded.

USAGE:

bd.partition(data, train=0.7, test=0.3, seed=NULL)

REQUIRED ARGUMENTS:

data
a bdFrame or data.frame.

OPTIONAL ARGUMENTS:

train
fraction of observations to be allocated to the training data set. Must be a numeric value from 0 to 1
test
fraction of observations to be allocated to the test data set. Must be a numeric value from 0 to 1
seed
if NULL, uses a new random seed for sampling every time. If an integer, it uses this for the seed. The default value will set the seed based on the S-PLUS random seed.

VALUE:

a list of bdFrame(s) or data.frame(s), of the same type as data.

SIDE EFFECTS:

The function causes creation of the dataset .Random.seed if it does not already exist, otherwise its value is updated.

DETAILS:

This function simply splits the input into multiple (up to three) outputs according to the train and test fraction parameters. The length of the returned list is dependent on the fractions input. For example, if 1.0 or greater is entered in the train parameter, only one output will be generated. If train=.25 and test=.75, a two element list will be returned. If train=.23 and test=.75, then the returned list will contain three objects the validation object will contain the remaining 2 percent of the observations.

SEE ALSO:

.

EXAMPLES:

# Partition fuel.frame into three data sets containing 65%, 20%
#   and 15% of the observations:
bd.partition(fuel.frame, 0.65, 0.20)