Use lmsreg on a Vector, Matrix, or Data Frame

DESCRIPTION:

Performs least median of squares regression of y on x. x and y are allowed to be either a vector, a matrix, or a data frame. This is the default method for the function lmsreg.

USAGE:

lmsreg.default(x, y, nsamp="standard", intercept=T, wt=T,
               diagnostic=F, yname=NULL, quan=<<see below>>, mve=T)

REQUIRED ARGUMENTS:

x
vector, matrix, or data frame of explanatory variables. Rows of the matrix represent observations, columns represent variables. A constant term should not be included; a better result is usually achieved by removing such a column and setting intercept=TRUE. Missing values ( NAs) and Infinite values ( Infs) are allowed.
y
vector, matrix, or data frame whose columns represent response variables. Missing values ( NAs) and Infinite values ( Infs) are allowed. Observations (rows) with missing or infinite values in either x or y are excluded from the computations method.

OPTIONAL ARGUMENTS:

nsamp
either a positive integer or the character strings "all" or "standard". If numeric, this specifies the number of non-singular random subsamples of the observations that are to be used. If nsamp="all", then all subsamples are to be found. Note that this can be a very large number even for quite small datasets. There are n-choose-p subsamples, where n is the number of observations and p is the number of explanatory variables. The default ( nsamp="standard") is to take all of the subsamples if there are less than 3000, and to find 3000 non-singular random samples otherwise.
intercept
logical flag: should a constant (intercept) term be included?
wt
logical flag: should weights computed by lmsreg be returned? These weights can be used in lsfit or lm to obtain a weighted least squares solution.
diagnostic
logical flag: if TRUE and if p>1, resistant diagnostics are returned. p is the number of explanatory variables.
yname
vector of character strings of the names of variables in y.
quan
the amount of observations that ought to be considered as a "half". The default value is floor(n/2) + floor((p+1)/2), where n is the number of observations and p is the number of explanatory variables.
mve
logical flag: if TRUE, cov.mve will be called on x. The results are needed in plot.lms for the diagnostic plot.

VALUE:

an object of class "lms" giving the solution. See the lms.object help file for details.

DETAILS:

Let p be the number of explanatory variables (including the intercept term, if present). For p>1, a large number of subsamples of p observations is taken. Each of these subsamples is used to get a trial set of coefficients. If intercept=TRUE, then the least median of squares location estimate is performed on the residuals based on these coefficients. The best set of coefficients for each regression is retained. "Best" in this context means the coefficients such that the quanth-order statistic of the absolute value of the residuals is smallest, where n is the number of observations.

Since specifying nsamp="all" can easily be a request for millions of samples, the maximum number of non-singular samples is limited to 30,000. This limit can be changed by editing the function.

The lmsreg.default function has a built-in random number generator that starts with the same seed on each call to lmsreg. Thus the same subsamples and hence the same answer will be found by similar calls. The default value of 3000 random samples will give greater than 99% probability of a 50% breakdown point for problems with nine or fewer explanatory variables. The probability of a high breakdown drops as the number of explanatory variables grows beyond ten.

For p=1, an exact algorithm location.lms is used. See the location.lms help file for more information.

BACKGROUND:

Rather than minimizing the sum of the squared residuals as least squares regression does, least median of squares (Rousseeuw, 1984) minimizes the median of the squared residuals. Actually, it is not precisely the median that is minimized but rather a certain order statistic of the squared (or absolute) residuals.

Least median of squares regression has a very high breakdown point of almost 50%. That is, almost half of the data can be corrupted in an arbitrary fashion and the least median of squares estimates continue to follow the majority of the data. At the present time this property is virtually unique among the robust regression methods that are publicly available, including the methods in rreg .

However, least median of squares is statistically very inefficient; one remedy to this is to use lm with the weights returned from lmsreg in a weighted least squares regression. This procedure will give high breakdown estimates that are also quite efficient. The test statistics derived from the least squares regression will not be strictly correct, but can be used informally.

NOTE:

The least trimmed squares method (Rousseeuw, 1984) is statistically more efficient than the least median of squares method, and the corresponding function ltsreg minimizes the objective much more efficiently. Therefore, often ltsreg is recommended. However, ltsreg does not allow multiple responses.

If the model also contains categorical or binary regressors, lmsreg will encounter many singular subsamples.

REFERENCES:

Rousseeuw, P. J. (1984). Least median of squares regression. Journal of the American Statistical Association , 79, 871-88.

Rousseeuw, P. J. and Leroy, A. M. (1987). Robust Regression and Outlier Detection. New York: Wiley.

SEE ALSO:

, , , , , , , .

EXAMPLES:

stacklms <- lmsreg(stack.x, stack.loss, nsamp="all")
# reweighted least squares
stackrls <- lm(stack.loss~stack.x, weights=as.logical(stacklms$lms.wt))