Big Data Linear Models

DESCRIPTION:

Fit a linear regression model on a big data object using the same syntax as the lm function. This function is typically not called directly by users but it is invoked through a call to lm when the data argument is of class "bdFrame"

This function requires the bigdata library section to be loaded.

USAGE:

bdLm(formula, data, weights, subset, na.action, contrasts=NULL, 
  correlation=T)

REQUIRED ARGUMENTS:

formula
a formula object, with the response on the left of a ~ operator and the terms, separated by + operators, on the right. The response must be a single numeric variable.
data
a bdFrame in which to interpret the variables named in the formula, subset, and weights arguments.

OPTIONAL ARGUMENTS:

weights
the observation weights. This must be one of the columns in data. If supplied, the fitting algorithm minimizes the sum of the weights multiplied by the squared residuals The weights must be nonnegative and it is recommended that they be strictly positive, since zero weights are ambiguous. To exclude particular observations from the model, use the subset argument instead of zero weights.
subset
an expression specifying which subset of observations should be used in the fit. This can be a logical vector (which is replicated to have length equal to the number of observations) or a numeric vector indicating the observation numbers to be included. All observations are included by default.
na.action
a function to filter missing data. This is applied to the model.frame after any subset argument has been applied. The default is na.fail, which returns an error if any missing values are found. An alternative is na.exclude, which deletes observations that contain one or more missing values.
contrasts
a list of contrasts to be used for some or all of the factors appearing as variables in the model formula. The names of the list should be the names of the corresponding variables. The elements of the list should be either contrast-type matrices (matrices with as many rows as levels of the factor, and with columns linearly independent of each other and of a column of ones), or else they should be functions that compute such contrast matrices. See the help file for contr.helmert for examples.
correlation
Logical indicating whether to return the computed correlation matrix for the coefficients in the model. The matrix of correlations will be of size p by p where p is the number of predictors. This can get large when there are factors with many levels. To avoid extracting this matrix, specify correlation=F.

VALUE:

an object of class bdLm. The methods for this object to provide big data equivalents of corresponding method for an lm object. While the methods behave the same, the actual structure of the bdLm and lm objects differ.

DETAILS:

The big data library provides two functions for fitting linear models: bdLm and bd.internal.fit.linear.regression. The bd.internal.fit.linear.regression function is an efficient function that performs a single pass through the data to fit regression coefficients. This is all of the information needed to predict on new values with the bd.internal.predict function. It does not provide the data preprocessing done by lm to handle formula, subset, and na.action arguments, and does not make an additional pass through the data after fitting the model to compute residuals.

The bdLm function performs the additional steps needed to accept the same arguments as lm and provide the same information regarding the fitted model.

If you are primarily interested in prediction, it is more efficient to use bd.internal.fit.linear.regression. If you will be using methods such as plot and summary to examine the model, use bdLm.

Limitations of bdLm relative to lm:
- the response variable cannot be a matrix
- the predictors cannot contain terms using: offset, poly, bs, ns
- there are no ordered factors for bigdata objects so the ordered factors contrasts will never be used
- the predict method, predict.bdLm, does not support the se.fit, ci.fit and pi.fit arguments
- the predict method does not support type="terms"
- the plot method, plot.bdLm,does not provide the normal QQ residuals plot, the rfplot nor the Cook's distance plot. The plots that are produced are hexbin scatterplots.

REFERENCES:

Chambers, J. M. and Hastie, T. J. (1993). Statistical Models in S. London: Chapman and Hall.

SEE ALSO:

EXAMPLES:

fuel.bdFrame <- as.bdFrame(fuel.frame)
bdLm(Mileage ~ Weight + Disp., data=fuel.bdFrame, subset=(Type != "Van"),
     na.action=na.exclude)