cov.mve
on a Vector, Matrix, or Data Frame
mve
containing robust estimates
of the covariance matrix,
the location of the data, and optionally the robust correlation matrix.
Specifically, the
cov.mve.default
function returns
weighted estimates, with weights based on
the minimum volume ellipsoid estimator proposed by Rousseeuw (1985).
This is the default method for the function
cov.mve
.
cov.mve.default(data, cor=F, print=T, popsize=<<see below>>, mutate.prob=c(0.15,0.2,0.2,0.2), random.n=<<see below>>, births.n=<<see below>>, stock=list(), maxslen=<<see below>>, stockprob=<<see below>>, nkeep=1, nsamp==<<see below>>)
NA
s) and Infinite values (
Inf
s) are allowed.
Observations (rows) with missing or infinite values are excluded
from the computations method.
TRUE
, then the estimated correlation matrix will be
returned as well.
TRUE
, a message about the number of samples taken and the
number of those samples that were singular will be printed.
(100*p)+(20*p^2)
, where
p
is the
number of variables.
stock
component of the output
of a previous call to the function.
p+1
if
(n-p)/2
is less than
p+1
, where
n
is the number
of observations, and it is the minimum of
trunc((n-p)/2)
and
5*p
otherwise.
i
th element corresponds to the individual with the
i
th lowest
objective.
The default is
cumsum((2 * (popsize:1))/popsize/(popsize + 1))
.
nsamp
is
always
popsize + births.n + random.n - length(stock)
. The default
value is the result of the right hand side of the above equation.
"mve"
representing the minimum volume ellipsoid
covariance estimation.
See the
mve.object
help file for details.
.Random.seed
if it does not already exist, otherwise its value is updated.
print
is
TRUE
, then a message is printed.
Let
n
be the number of observations and
p
be the number of variables.
The minimum volume ellipsoid covariance estimate is the covariance matrix
that is defined by the ellipsoid with minimum volume of those ellipsoids
that contain
floor((n+p+1)/2)
of the datapoints.
For multivariate data sets, it takes too much time to find the
exact estimate, so an approximation is computed.
A genetic algorithm, described in Burns (1992), is used.
Individual solutions are defined by a set of observation numbers,
each set of observations yielding a classical covariance matrix.
A stock of
popsize
individuals is produced by random sampling, then
a number of random samples are taken and the best solutions are saved in
the stock.
During the genetic phase, two parents are picked which produce an offspring
that contains a sample of the observations from the parents.
The best two out of the three are retained in the stock.
The best of all of the solutions found is used to compute the
final covariance matrix.
The standard random sampling algorithm can be used by setting
popsize
to
one,
maxslen
to
p+1
, and
births.n
to zero.
The
mutate.prob
argument controls the mutation of the offspring.
The length of the offspring is initially set to be the length of the first
parent.
This length is reduced by one, increased by one, or given a
length uniformly distributed between
p+1
and
maxslen
, according to the
last three probabilities in
mutate.prob
.
The other type of mutation that can occur is for one of the observations
of the offspring to be changed to an observation picked at random from
among all of the observations; the probability of this mutation is specified
by the first element of
mutate.prob
.
It is suggested that the number of observations be at least five times the
number of variables.
When there are fewer observations than this, there is not enough information
to accurately determine if outliers exist.
The minimum volume ellipsoid is not allowed to have zero volume, hence
singular covariance matrices from subsamples are ignored (except for being
counted).
If your data has a covariance matrix that is singular,
cov.mve
will fail because all of the covariance matrices
of the subsamples will be singular.
In this case, you will need to modify your data before applying
cov.mve
,
perhaps by using
princomp
and deleting columns with zero variance.
For univariate data sets, an exact algorithm
location.lms
is used. See
the
location.lms
help file for more information.
Although the minimum volume ellipsoid covariance estimate has a
very high breakdown point, it is inefficient.
More efficiency can be attained while retaining the high breakdown point by
performing a weighted covariance estimate with weights based on the minimum
volume ellipsoid estimate.
Such an estimate is what
cov.mve
returns.
The Mahalanobis distance (computed using a scaling of the minimum
volume ellipsoid covariance estimate) of each observation is compared
to the Chisquare .975 quantile; those observations with smaller distances
than this are given weight
1
, and the others are given weight
0
.
The
cov.wt
function is then used with these weights.
This was proposed in Rousseeuw and van Zomeren (1990).
The minimum volume ellipsoid covariance estimator (Rousseeuw, 1985) has a breakdown point that is almost 50%. That is, the estimate cannot be made arbitrarily bad without changing about half of the data. A covariance matrix is considered to be arbitrarily bad if either a component goes to infinity (just as in the breakdown of a location or regression estimate), or if the matrix becomes deficient in rank. This is analogous to a scale estimate breaking down if the estimate is going either to infinity or to zero.
Burns, P. J. (1992). A genetic algorithm for robust regression estimation.
(StatSci Technical Note).
Lopuhaa, H. P. and Rousseeuw, P. J. (1991).
Breakdown points of affine equivariant estimators of multivariate location and
covariance matrices.
Annals of Statistics,
19, 229-248.
Rousseeuw, P. J. (1985).
Multivariate estimation with high breakdown point.
In
Mathematical Statistics and Applications.
W. Grossmann, G. Pflug, I. Vincze and W. Wertz, eds.
Reidel: Dordrecht, 283-297.
Rousseeuw, P. J. (1991).
A diagnostic plot for regression outliers and leverage points.
Computational Statistics and Data Analysis,
11, 127-129.
Rousseeuw, P. J. and van Zomeren, B. C. (1990).
Unmasking multivariate outliers and leverage points (with discussion).
Journal of the American Statistical Association,
85, 633-651.
Woodruff, D. L. and Rocke, D. M. (1993).
Heuristic search algorithms for the minimum volume ellipsoid estimator.
Journal of Computational and Graphical Statistics,
2, 69-95.
fr.cov <- cov.mve(freeny.x) cov.mve(freeny.x, stock=fr.cov$stock, births=1000)