Fit Linear Regression Model

DESCRIPTION:

Returns an object of class "lm", "mlm", or "bdLm" that represents a linear model fit.

USAGE:

lm(formula, data=<<see below>>, weights=<<see below>>,  
   subset=<<see below>>, na.action=na.fail, method="qr", model=F,  
   x=F, y=F, contrasts=NULL, ...) 

REQUIRED ARGUMENTS:

formula
a formula object, with the response on the left of a ~ operator and the terms, separated by + operators, on the right. The response may be a single numeric variable or a matrix.

OPTIONAL ARGUMENTS:

data
a data frame or bdFrame in which to interpret the variables named in the formula, subset, and weights arguments. If data is bdFrame then the function bdLm will be called. See the DETAILS section below for additional information and restrictions when using lm with a bdFrame. data may also be a single number to handle some special cases -- see below for details. If data is missing, the variables in the model formula should be in the search path.
weights
vector of observation weights. If supplied, the fitting algorithm minimizes the sum of the weights multiplied by the squared residuals (see below for additional technical details). The length of weights must be the same as the number of observations. The weights must be nonnegative and it is recommended that they be strictly positive, since zero weights are ambiguous. To exclude particular observations from the model, use the subset argument instead of zero weights.
subset
expression specifying which subset of observations should be used in the fit. This can be a logical vector (which is replicated to have length equal to the number of observations), a numeric vector indicating the observation numbers to be included, or a character vector of the observation names that should be included. All observations are included by default.
na.action
a function to filter missing data. This is applied to the model.frame after any subset argument has been applied. The default is na.fail, which returns an error if any missing values are found. An alternative is na.exclude, which deletes observations that contain one or more missing values.
method
the least squares fitting method to be used; the options are "qr", "svd", and "chol", and the default is "qr". The method "model.frame" simply returns the model frame.
model
logical flag: if TRUE, then the model frame is returned in the model component of the fitted object.
x
logical flag: if TRUE, then the model matrix is returned in the x component of the fitted object.
y
logical flag: if TRUE, then the response is returned in the y component of the fitted object.
qr
logical flag: if TRUE, then the QR decomposition of the model matrix is returned in the qr component of the fitted object.
contrasts
a list giving contrasts for some or all of the factors appearing in the model formula. An element in the list should have the same name as the factor variable it encodes, and it should be either a contrast matrix (any full-rank matrix with as many rows as there are levels in the factor), or a function that computes such a matrix given the number of levels.
...
additional arguments for the fitting routines; see lm.fit and the functions it calls. Two possibilities are singular.ok=T, which instructs the fitting algorithm to continue in the presence of over-determined models, and tolerance, which specifies the tolerance level for over-determined models. The default tolerance is 1e-07.

VALUE:

an object of class "lm" or "mlm" representing the fit. See lm.object for details. If the response is a matrix, then the returned object is of class "mlm". In this case, the coefficients, residuals, and effects are also matrices, with columns corresponding to the individual response variables.

DETAILS:

If the data argument is a bdFrame then the function bdLm is immediately called by lm. The bdLm function does not support all the arguments that lm does. See the help file for more information.

The formula argument is passed around unevaluated; that is, the variables in the formula are defined when the model frame is computed, and not when lm is initially called. In particular, if data is given, the variables in formula should generally be defined as variables in data.

Because they are passed unevaluated from one function to another, variables in a model formula are evaluated differently than arguments to S-PLUS functions. Functions such as lm that are able to evaluate the formula variables try to establish a context based on the data argument. More precisely, the function model.frame.default does the actual evaluation, assuming that its caller behaves in the way described here. If the data argument to lm is missing or is an object (typically, a data frame), then the local context for variable names is the frame of the function that called lm. If the user called lm directly, the local context for variable names is the top-level expression frame. Names in the model formula can refer to variables in the local context, as well as to global variables or variables in the data object.

The data argument can also be a number, in which case it defines the local context. This can arise, for example, if a function is written to call lm but the local context is definitely not that function's frame. In this case, the function can set data to sys.parent(), and the local context will be the next function up in the calling stack. See the last example below for an illustration of this. A numeric value for data can also be supplied if a local context is explicitly created by a call to new.frame. Note that supplying data as a number implies that it is the only local context; local variables in any other function will not be available when the model frame is evaluated. This is potentially subtle. Fortunately, it is not something the ordinary user of lm needs to worry about. It is relevant, however, for those writing functions that call lm (or other similar model-fitting functions).

The subset argument, like the terms in the model formula, is evaluated in the context of the data argument, if present. The specific action of subset is as follows: the model frame, including weights and subset, is computed on all rows of the data set and then the appropriate subset is extracted. A variety of special cases make such an interpretation desirable. For example, functions such as lag may need more than the data used in the fit to be fully defined. On the other hand, if you use subset to avoid computing undefined values or to escape warning messages, you may be surprised. For example,

lm(y ~ log(x), data=mydata, subset=x > 0)

still generates warnings from log. To avoid this, do the subsetting on the data frame directly:

lm(y ~ log(x), data=mydata[mydata$x > 0, ])

NOTES:

Generic functions such as print and summary have methods for showing the results of a fit. See lm.object for a description of the fit components. The functions residuals, coefficients , and effects should be used to extract components, rather than subscripting them directly from the lm.object. The extractor functions take correct account of special circumstances, such as overdetermined models.

S-PLUS implements observation weights through the weights argument to most regression functions. Observation weights are appropriate when the variances of individual observations are inversely proportional to the weights. For a set of weights wi, one interpretation is that the ith observation is the average of wi other observations, each having the same predictors and (unknown) variance. This is the interpretation of the weights included in the claims example below. Another situation in which these types of weights arise is when the relative precision of the observations is known in advance.

It is important to note that an observation weight is not the same as a frequency, or case weight, which represents the number of times a particular observation is repeated. It is possible to include frequencies as a weights argument to a S-PLUS regression function; although this produces the correct coefficients for the model, inference tools such as standard errors, p-values, and confidence intervals are incorrect. In addition, S-PLUS does not currently support weighted regression when the absolute precision of the observations is known. This situation arises often in physics and engineering, when the uncertainty associated with a particular measurement is known in advance due to properties of the measuring procedure or device. If you know the absolute precision of your observations, it is possible to supply them to the weights argument. This computes the correct coefficients for your model, but the standard errors and other inference tools will be incorrect.

REFERENCES:

Belsley, D. A., Kuh, E. and Welsch, R. E. (1980). Regression Diagnostics. New York: Wiley.

Draper, N. R. and Smith, H. (1981). Applied Regression Analysis (second edition). New York: Wiley.

Myers, R. H. (1986). Classical and Modern Regression with Applications. Boston: Duxbury.

Rousseeuw, P. J. and Leroy, A. (1987). Robust Regression and Outlier Detection. New York: Wiley.

Seber, G. A. F. (1977). Linear Regression Analysis. New York: Wiley.

Weisberg, S. (1985). Applied Linear Regression (second edition). New York: Wiley.

There is a vast literature available on regression; the references above are just a small sample. The book by Myers is an introductory text that includes a discussion of many of the recent advances in regression technology. The Seber book is at a higher mathematical level and covers much of the classical theory of least squares.

SEE ALSO:

, , , , , , , See for a description of the syntax of formulas.

EXAMPLES:

lm(freeny.y ~ freeny.x) 
lm(Fuel ~ . , data=fuel.frame)

# formulas have intercepts by default, so include 
# a -1 for regression without an intercept.
lm(Mileage ~ Weight - 1, data=fuel.frame) 

# example of weighted regression 
lm(cost ~ age + type + car.age, data=claims, 
    weights=number, na.action=na.exclude) 

# myfit calls lm, using the caller to myfit 
# as the local context for variables in the formula 
# (see aov for an actual example) 
myfit <- function(formula, data=sys.parent(), ...) { 
    .. .. 
    fit <- lm(formula, data, ...) 
    .. .. 
}