Handle Missing Values

DESCRIPTION:

Specify a method for dealing with missing values in your data set.

This function requires the bigdata library section to be loaded.

USAGE:

bd.remove.missing(data, columns, methods="drop",
                   replacement.values=0,
                   key.columns=character(0))

REQUIRED ARGUMENTS:

data
An input data set, a bdFrame, or a data.frame.

OPTIONAL ARGUMENTS:

columns
The names or numbers of columns to examine for missing values. If you do not specify a value, then the default all is used.
methods
The vector of methods for processing missing value columns. Options are none (no change), dropRows or drop (drop rows where this column contains a missing value), generateFromDistribution or distribution (replace NA with a value selected from distribution), replaceWithMean or mean (replace NA with mean), replaceWithConstant or constant (replace NA with a value from replacement.values), lastObservation or last (replace NA with the last value from the row with the same value in the column given by key.columns).
replacement.values
The vector of replacement values used with the replaceWithConstant method.
key.columns
The vector of key column names used with the lastObservation method. These should be factor columns.

VALUE:

A bdFrame or data.frame, of the same type as x.

DETAILS:

The Missing Values component supports five different methods for dealing with missing values in your data set:

Drop Rows
This option drops all rows that contain missing values from your data set.
Generate from Distribution
This option generates sensible values from the marginal distributions of the columns that contain missing values. For a categorical variable, values are generated based on the proportion of observations corresponding to each level. For a continuous variable, a histogram of the data is computed, and then values based on the heights of the histogram bars are generated.
Replace with Mean
This option replaces each missing value with the average of the values in the corresponding column. For a categorical variable, missing values are replaced with the level that appears most often. In the event of ties, the first level that appears in the data set is chosen.
Replace with Constant
This option replaces each missing value with a constant you specify.
Last Observation Carried Forward
This option replaces a missing value with the last non-missing value from the last row with the same key column value. The key column is specified by the key.columns argument. If the key column is not given or is an empty string, then this option replaces a missing value with the last non-missing value in the same column.

EXAMPLES:

## Drop Rows
bd.remove.missing(data.frame(c(1:10, NA)), methods="dropRows")
bd.remove.missing(data.frame(c(1:10, NA)), methods="drop")

## Replace with constant
bd.remove.missing(data.frame(c(1:10, NA)), methods="replaceWithConstant", replacement.values="2")
bd.remove.missing(data.frame(c("A","B", NA)), methods="constant", replacement.values="MissingData")

## Replace with generated value
bd.remove.missing(data.frame(c(1:10, NA)), methods="generateFromDistribution")
bd.remove.missing(data.frame(c("A","B", NA)), methods="dist")