"lm"
,
"mlm"
,
or
"bdLm"
that represents a linear model fit.
lm(formula, data=<<see below>>, weights=<<see below>>, subset=<<see below>>, na.action=na.fail, method="qr", model=F, x=F, y=F, contrasts=NULL, ...)
~
operator
and the terms, separated by
+
operators,
on the right.
The response may be a single numeric variable or a matrix.
formula
,
subset
, and
weights
arguments.
If data is bdFrame then the function
bdLm
will be called.
See the DETAILS section below for additional information and
restrictions when using lm with a bdFrame.
data
may also be a single number to handle some special cases --
see below for details.
If
data
is missing,
the variables in the model formula should be in the search path.
weights
must be the same
as the number of observations.
The weights must be nonnegative
and it is recommended that they be strictly positive,
since zero weights are ambiguous.
To exclude particular observations from the model,
use the
subset
argument instead of zero weights.
model.frame
after any
subset
argument has been applied.
The default is
na.fail
,
which returns an error if any missing values are found.
An alternative is
na.exclude
,
which deletes observations that contain one or more missing values.
"qr"
,
"svd"
, and
"chol"
,
and the default is
"qr"
.
The method
"model.frame"
simply
returns the model frame.
TRUE
,
then the model frame is returned in the
model
component of the fitted object.
TRUE
,
then the model matrix is returned in the
x
component of the fitted object.
TRUE
,
then the response is returned in the
y
component of the fitted object.
TRUE
,
then the QR decomposition of the model matrix is returned
in the
qr
component of the fitted object.
lm.fit
and the functions it calls.
Two possibilities are
singular.ok=T
,
which instructs the fitting algorithm to continue in the presence
of over-determined models,
and
tolerance
,
which specifies the tolerance level for over-determined models.
The default tolerance is 1e-07.
"lm"
or
"mlm"
representing the fit.
See
lm.object
for details.
If the response is a matrix,
then the returned object is of class
"mlm"
.
In this case, the coefficients, residuals, and effects are also matrices,
with columns corresponding to the individual response variables.
If the
data
argument is a bdFrame then the function
bdLm
is immediately called by
lm
.
The
bdLm
function does not support all the arguments that
lm
does.
See the
help file
for more information.
The
formula
argument is passed around
unevaluated; that is,
the variables in the formula are defined when the model frame is computed,
and not when
lm
is initially called.
In particular, if
data
is given,
the variables in
formula
should generally be defined as variables in
data
.
Because they are passed unevaluated from one function to another,
variables in a model formula are evaluated differently
than arguments to S-PLUS functions.
Functions such as
lm
that are able
to evaluate the formula variables try to establish a context
based on the
data
argument.
More precisely, the function
model.frame.default
does the actual evaluation,
assuming that its caller behaves in the way described here.
If the
data
argument
to
lm
is missing
or is an object (typically, a data frame),
then the local context for variable names is the frame of the function
that called
lm
.
If the user called
lm
directly,
the local context for variable names is the top-level expression frame.
Names in the model formula can refer to variables in the local context,
as well as to global variables
or variables in the
data
object.
The
data
argument can also be a number,
in which case it defines the local context.
This can arise, for example, if a function is written
to call
lm
but the local context is definitely not that function's frame.
In this case, the function can set
data
to
sys.parent()
,
and the local context will be the next function up in the calling stack.
See the last example below for an illustration of this.
A numeric value for
data
can
also be supplied if a local context is explicitly created
by a call to
new.frame
.
Note that supplying
data
as a number
implies that it is the only local context;
local variables in any other function will not be available
when the model frame is evaluated.
This is potentially subtle.
Fortunately, it is not something the ordinary user
of
lm
needs to worry about.
It is relevant, however, for those writing functions
that call
lm
(or other similar model-fitting functions).
The
subset
argument,
like the terms in the model formula,
is evaluated in the context of the
data
argument,
if present.
The specific action of
subset
is as follows:
the model frame, including
weights
and
subset
,
is computed on all rows of the data set
and then the appropriate subset is extracted.
A variety of special cases make such an interpretation desirable.
For example, functions such as
lag
may need more than the data used in the fit to be fully defined.
On the other hand, if you use
subset
to avoid computing undefined values or to escape warning messages,
you may be surprised.
For example,
lm(y ~ log(x), data=mydata, subset=x > 0)
still generates warnings from
log
.
To avoid this, do the subsetting on the data frame directly:
lm(y ~ log(x), data=mydata[mydata$x > 0, ])
Generic functions such as
print
and
summary
have methods
for showing the results of a fit.
See
lm.object
for a description
of the fit components.
The functions
residuals
,
coefficients
,
and
effects
should be used to extract components,
rather than subscripting them directly
from the
lm.object
.
The extractor functions take correct account of special circumstances,
such as overdetermined models.
S-PLUS implements observation weights
through the
weights
argument
to most regression functions.
Observation weights are appropriate when the variances
of individual observations are inversely proportional to the weights.
For a set of weights
wi
,
one interpretation is that the ith observation
is the average of
wi
other observations,
each having the same predictors and (unknown) variance.
This is the interpretation of the weights included
in the
claims
example below.
Another situation in which these types of weights arise is
when the relative precision of the observations is known in advance.
It is important to note that an observation weight is not the same
as a frequency, or case weight,
which represents the number of times a particular observation is repeated.
It is possible to include frequencies as
a
weights
argument
to a S-PLUS regression function;
although this produces the correct coefficients for the model,
inference tools such as standard errors, p-values,
and confidence intervals are incorrect.
In addition, S-PLUS does not currently support weighted regression
when the absolute precision of the observations is known.
This situation arises often in physics and engineering,
when the uncertainty associated with a particular measurement
is known in advance due to properties of the measuring procedure or device.
If you know the absolute precision of your observations,
it is possible to supply them
to the
weights
argument.
This computes the correct coefficients for your model,
but the standard errors and other inference tools will be incorrect.
Belsley, D. A., Kuh, E. and Welsch, R. E. (1980). Regression Diagnostics. New York: Wiley.
Draper, N. R. and Smith, H. (1981). Applied Regression Analysis (second edition). New York: Wiley.
Myers, R. H. (1986). Classical and Modern Regression with Applications. Boston: Duxbury.
Rousseeuw, P. J. and Leroy, A. (1987). Robust Regression and Outlier Detection. New York: Wiley.
Seber, G. A. F. (1977). Linear Regression Analysis. New York: Wiley.
Weisberg, S. (1985). Applied Linear Regression (second edition). New York: Wiley.
There is a vast literature available on regression; the references above are just a small sample. The book by Myers is an introductory text that includes a discussion of many of the recent advances in regression technology. The Seber book is at a higher mathematical level and covers much of the classical theory of least squares.
lm(freeny.y ~ freeny.x) lm(Fuel ~ . , data=fuel.frame) # formulas have intercepts by default, so include # a -1 for regression without an intercept. lm(Mileage ~ Weight - 1, data=fuel.frame) # example of weighted regression lm(cost ~ age + type + car.age, data=claims, weights=number, na.action=na.exclude) # myfit calls lm, using the caller to myfit # as the local context for variables in the formula # (see aov for an actual example) myfit <- function(formula, data=sys.parent(), ...) { .. .. fit <- lm(formula, data, ...) .. .. }