lsfit(x, y, wt=<<see below>>, intercept=T, tolerance=1.e-07, yname=NULL)
1
s
unless the argument
intercept
is
FALSE
.
The number of rows of
x
should
equal the number of observations in
y
,
and there should be fewer columns than rows.
NA
s and
Inf
s
are allowed but will be removed.
This can also be a
bdNumeric
or numeric
bdFrame
, in which case
y
must be a
bdNumeric
of the same length.
NA
s
and
Inf
s are allowed but will be removed.
wt
should be inversely proportional
to the variance.
By default, an unweighted regression is carried out.
NA
s and
Inf
s
are allowed but will be removed. This argument is not supported when
x
is a big data object.
TRUE
,
a constant (intercept) term is included in each regression.
x
is a big data object.
y
variates
in the regression output.
However, if
y
is a matrix
with
dimnames
attribute containing column names,
then these will be used.
This argument is not supported when
x
is a big data object.
y
has more than one column, in which case
coef
contains one column for each regression
with optional constant terms in the first row.
Its
dimnames
are taken from
x
,
y
and
yname
if applicable.
y
containing residuals. This component is not present when
x
is a big data object.
wt
was given as an argument, it is also returned as part of the result.
x
matrix
(plus a column of
1
s, if an intercept was included).
If
wt
was specified, the
qr
object will represent the decomposition
of the weighted
x
matrix.
See function
qr
for the details of this object.
It is used primarily with functions like
qr.qty
,
that compute auxiliary results for the regression from the decomposition.
This component is not present when
x
is a big data object, but some equivalent information is available in the
details
component.
x
is a big data object, the results include a
details
component. This is a list of summary information available from the big data linear regression routine that is not part of the standard
lsfit
results.
An observation is considered unusable
if there is an
NA
or
Inf
in any response variable,
any explanatory variable or in the weight (if present) for the observation.
If your data have several missing values, there may be much better ways of
analyzing your data than throwing out the observations like this; see, for
instance, chapter 10 of Weisberg (1985).
The
lsfit
function does least squares regression,
that is, it finds a set of parameters such that the (weighted)
sum of squared residuals is minimized.
The (implicit) assumption of least squares is that the errors have a
Gaussian distribution - if there are outliers, the results of the regression
may be misleading.
The assumptions of regression are that the observations are statistically
independent,
the response
y
is linear
in the covariates represented by
x
,
and that there is no error in
x
.
A time series model is one alternative if the observations are not independent.
The linearity assumption is loosened in
ace
,
avas
and
ppreg
.
A robust regression can help if there are gross errors
in
x
(e.g., typographical errors)
since this will likely make the corresponding responses appear
to be gross outliers;
these points are likely to have high leverage
(see
hat
).
If the
x
matrix is not known with certainty
(an "errors-in-variables" model),
the regression coefficients will typically be biased downward.
The classical use of a weighted regression is to handle the case when
the variability of the response is not the same for all observations.
Another approach to this same problem is to transform
y
and/or the variables in
x
so that there is constant variance and linearity holds.
In practice it is often the case that a transformation which helps linearity
also improves problems with the variance.
If a choice is to be made, the linearity is more important
since a weighted regression can be used.
It is good data analysis practice to view plots to check the suitability of
a solution.
Appropriate plots include the residuals versus the fit,
the residuals versus the
x
variables, and a qqplot of the residuals.
Polynomial regression can be performed with
lsfit
by using a command
similar to
cbind(x, x^2)
.
It is better numerical practice to create orthogonal polynomials,
especially as the order of the polynomial increases.
When orthogonal polynomials are not used, the columns of
the
x
matrix can be quite collinear
(one column is close to being a linear combination of other columns).
Collinearity outside of the polynomial regression case can
cloud interpretation of the results as well as being a numerical concern.
Belsley, D. A., Kuh, E. and Welsch, R. E. (1980).
Regression Diagnostics.
Wiley, New York.
Draper, N. R. and Smith, H. (1981).
Applied Regression Analysis.
(second edition). Wiley, New York.
Myers, R. H. (1986).
Classical and Modern Regression with Applications.
Duxbury, Boston.
Rousseeuw, P. J. and Leroy, A. (1987).
Robust Regression and Outlier Detection.
Wiley, New York.
Seber, G. A. F. (1977).
Linear Regression Analysis.
Wiley, New York.
Weisberg, S. (1985).
Applied Linear Regression. Second Edition.
Wiley, New York.
There is a vast literature on regression, the references above are just
a small sample of what is available.
The book by Myers is an introductory text that includes a discussion
of much of the recent advances in regression technology.
The Seber book is at a higher mathematical level
and covers much of the classical theory of least squares.
regfreeny <- lsfit(freeny.x, freeny.y) ls.print(regfreeny)