lm
function.
This function is typically not called directly by users
but it is invoked through a call to
lm
when the
data
argument is of class
"bdFrame"
This function requires the bigdata library section to be loaded.
bdLm(formula, data, weights, subset, na.action, contrasts=NULL, correlation=T)
bdFrame
in which to interpret the variables
named in the
formula
,
subset
, and
weights
arguments.
data
.
If supplied, the fitting algorithm
minimizes the sum of the weights multiplied by the squared residuals
The weights must be nonnegative
and it is recommended that they be strictly positive,
since zero weights are ambiguous.
To exclude particular observations from the model,
use the subset argument instead of zero weights.
na.fail
,
which returns an error if any missing values are found.
An alternative is
na.exclude
,
which deletes observations that contain one or more missing values.
contr.helmert
for examples.
p
by
p
where
p
is the
number of predictors. This can get large when there are factors with
many levels. To avoid extracting this matrix, specify
correlation=F
.
bdLm
.
The methods for
this object to provide big data equivalents of corresponding method
for an
lm
object.
While the methods behave the same, the actual structure of the
bdLm
and
lm
objects differ.
The big data library provides two functions for fitting linear models:
bdLm
and
bd.internal.fit.linear.regression
.
The
bd.internal.fit.linear.regression
function is an
efficient function that performs a single pass through the data to fit
regression coefficients.
This is all of the information needed to predict on new values
with the
bd.internal.predict
function.
It does not provide the data preprocessing done by
lm
to handle formula, subset, and
na.action arguments, and does not make an additional pass through the
data after fitting the model to compute residuals.
The
bdLm
function performs the additional
steps needed to accept the same arguments as
lm
and provide the same information
regarding the fitted model.
If you are primarily interested in prediction,
it is more efficient to use
bd.internal.fit.linear.regression
.
If you will be using methods such as
plot
and
summary
to examine the model, use
bdLm
.
Limitations of
bdLm
relative to
lm
:
- the response variable cannot be a matrix
- the predictors cannot contain terms using:
offset, poly, bs, ns
- there are no ordered factors for bigdata objects so the ordered factors
contrasts will never be used
- the predict method,
predict.bdLm
, does not support the
se.fit
,
ci.fit
and
pi.fit
arguments
- the predict method does not support
type="terms"
- the plot method,
plot.bdLm
,does not provide the
normal QQ residuals plot, the rfplot nor the Cook's distance plot.
The plots that are produced are hexbin scatterplots.
Chambers, J. M. and Hastie, T. J. (1993). Statistical Models in S. London: Chapman and Hall.
fuel.bdFrame <- as.bdFrame(fuel.frame) bdLm(Mileage ~ Weight + Disp., data=fuel.bdFrame, subset=(Type != "Van"), na.action=na.exclude)