lmsreg
on a Vector, Matrix, or Data Frame
y
on
x
.
x
and
y
are allowed to be either a vector, a matrix, or a data frame.
This is the default method for the function
lmsreg
.
lmsreg.default(x, y, nsamp="standard", intercept=T, wt=T, diagnostic=F, yname=NULL, quan=<<see below>>, mve=T)
intercept=TRUE
.
Missing values (
NA
s) and Infinite values (
Inf
s) are allowed.
NA
s) and Infinite values (
Inf
s) are allowed.
Observations (rows) with missing or infinite values in either
x
or
y
are excluded from the computations method.
"all"
or
"standard"
.
If numeric, this specifies the number of non-singular random
subsamples of the observations
that are to be used. If
nsamp="all"
, then all subsamples are to be found.
Note that this can be a very large number even for quite small datasets.
There are n-choose-p subsamples, where n is the number of observations and p
is the number of explanatory variables.
The default (
nsamp="standard"
) is to take all of the subsamples if there are
less than 3000, and to find 3000 non-singular random samples otherwise.
lmsreg
be returned?
These weights can be used in
lsfit
or
lm
to
obtain a weighted least squares solution.
TRUE
and if p>1, resistant diagnostics are returned.
p
is the number of explanatory variables.
y
.
floor(n/2) + floor((p+1)/2)
, where
n
is the number
of observations and
p
is the number of explanatory variables.
TRUE
,
cov.mve
will be called on x.
The results are needed in
plot.lms
for the diagnostic plot.
"lms"
giving the solution.
See the
lms.object
help file for details.
Let
p
be the number of explanatory variables
(including the intercept term, if present).
For p>1, a large number of subsamples of
p
observations
is taken.
Each of these subsamples is used to get a trial set of coefficients.
If
intercept=TRUE
, then the least median of squares location estimate is
performed on the residuals based on these coefficients.
The best set of coefficients for each regression is retained.
"Best" in this context means the coefficients such that the
quan
th-order
statistic of the absolute value of the residuals is smallest,
where
n
is the number of observations.
Since specifying
nsamp="all"
can easily be a request for millions of samples,
the maximum number of non-singular samples is limited to 30,000.
This limit can be changed by editing the function.
The
lmsreg.default
function has a built-in random number generator that
starts with the same seed on each call to
lmsreg
.
Thus the same subsamples and hence the same answer will be found by similar
calls.
The default value of 3000 random samples will give greater than 99% probability
of a 50% breakdown point for problems with nine or fewer explanatory variables.
The probability of a high breakdown drops as the number of explanatory
variables grows beyond ten.
For
p=1
, an exact algorithm
location.lms
is used.
See the
location.lms
help file for more information.
Rather than minimizing the sum of the squared residuals as least squares
regression does,
least median of squares (Rousseeuw, 1984) minimizes the median of the squared
residuals.
Actually, it is not precisely the median that is minimized
but rather a certain order statistic of the squared (or absolute) residuals.
Least median of squares regression has a very high breakdown point of almost
50%.
That is, almost half of the data can be corrupted in an arbitrary fashion
and the least median of squares estimates continue to follow the
majority of the data.
At the present time this property is virtually unique among the robust
regression methods that are publicly available, including the methods in
rreg
.
However, least median of squares is statistically very inefficient;
one remedy to this is to use
lm
with the weights returned from
lmsreg
in a weighted least squares regression.
This procedure will give high breakdown estimates
that are also quite efficient.
The test statistics derived from the least squares regression
will not be strictly correct, but can be used informally.
The least trimmed squares method (Rousseeuw, 1984)
is statistically more efficient than the least median of squares method,
and the corresponding function
ltsreg
minimizes the
objective much more efficiently.
Therefore, often
ltsreg
is recommended.
However,
ltsreg
does not allow multiple responses.
If the model also contains categorical or binary
regressors,
lmsreg
will encounter many singular subsamples.
Rousseeuw, P. J. (1984).
Least median of squares regression.
Journal of the American Statistical Association
,
79, 871-88.
Rousseeuw, P. J. and Leroy, A. M. (1987).
Robust Regression and Outlier Detection.
New York: Wiley.
stacklms <- lmsreg(stack.x, stack.loss, nsamp="all") # reweighted least squares stackrls <- lm(stack.loss~stack.x, weights=as.logical(stacklms$lms.wt))