Use cov.mve on a Vector, Matrix, or Data Frame

DESCRIPTION:

Returns a list of class mve containing robust estimates of the covariance matrix, the location of the data, and optionally the robust correlation matrix. Specifically, the cov.mve.default function returns weighted estimates, with weights based on the minimum volume ellipsoid estimator proposed by Rousseeuw (1985). This is the default method for the function cov.mve.

USAGE:

cov.mve.default(data, cor=F, print=T, popsize=<<see below>>, 
                mutate.prob=c(0.15,0.2,0.2,0.2), random.n=<<see below>>, 
                births.n=<<see below>>, stock=list(), maxslen=<<see below>>, 
                stockprob=<<see below>>, nkeep=1, nsamp==<<see below>>) 

REQUIRED ARGUMENTS:

data
a vector, matrix, or data frame. Columns represent variables, rows represent observations. Missing values ( NAs) and Infinite values ( Infs) are allowed. Observations (rows) with missing or infinite values are excluded from the computations method.

OPTIONAL ARGUMENTS:

cor
logical flag: if TRUE, then the estimated correlation matrix will be returned as well.
print
logical flag: if TRUE, a message about the number of samples taken and the number of those samples that were singular will be printed.
popsize
the population size of the genetic stock. The default is 10 times the number of variables.
mutate.prob
length 4 vector of mutation probabilities for offspring. The first element is the probability of a mutation to one observation in the offspring. The second through fourth elements give the probability that the length of the offspring will be one shorter than the mother, one longer than the mother, or a random length, respectively.
random.n
the number of random samples taken after the stock is filled. The default is 50 times the number of variables.
births.n
the number of genetic births. The default is (100*p)+(20*p^2), where p is the number of variables.
stock
a list of vectors of observation numbers to be included in the stock. This is typically the stock component of the output of a previous call to the function.
maxslen
the maximum number of observations (including duplicates) in a member of the stock. The default is p+1 if (n-p)/2 is less than p+1, where n is the number of observations, and it is the minimum of trunc((n-p)/2) and 5*p otherwise.
stockprob
vector of cumulative probabilities that a member of the stock will be chosen as a parent. The ith element corresponds to the individual with the ith lowest objective. The default is cumsum((2 * (popsize:1))/popsize/(popsize + 1)).
nkeep
the number of individuals in the stock to keep in the output.
nsamp
the total number of samples taken after the stock is filled. nsamp is always popsize + births.n + random.n - length(stock). The default value is the result of the right hand side of the above equation.

VALUE:

an object of class "mve" representing the minimum volume ellipsoid covariance estimation. See the mve.object help file for details.

SIDE EFFECTS:

For multivariate data sets: creates the dataset .Random.seed if it does not already exist, otherwise its value is updated.

If print is TRUE, then a message is printed.

DETAILS:

Let n be the number of observations and p be the number of variables. The minimum volume ellipsoid covariance estimate is the covariance matrix that is defined by the ellipsoid with minimum volume of those ellipsoids that contain floor((n+p+1)/2) of the datapoints. For multivariate data sets, it takes too much time to find the exact estimate, so an approximation is computed.

A genetic algorithm, described in Burns (1992), is used. Individual solutions are defined by a set of observation numbers, each set of observations yielding a classical covariance matrix. A stock of popsize individuals is produced by random sampling, then a number of random samples are taken and the best solutions are saved in the stock. During the genetic phase, two parents are picked which produce an offspring that contains a sample of the observations from the parents. The best two out of the three are retained in the stock. The best of all of the solutions found is used to compute the final covariance matrix. The standard random sampling algorithm can be used by setting popsize to one, maxslen to p+1, and births.n to zero.

The mutate.prob argument controls the mutation of the offspring. The length of the offspring is initially set to be the length of the first parent. This length is reduced by one, increased by one, or given a length uniformly distributed between p+1 and maxslen, according to the last three probabilities in mutate.prob. The other type of mutation that can occur is for one of the observations of the offspring to be changed to an observation picked at random from among all of the observations; the probability of this mutation is specified by the first element of mutate.prob.

It is suggested that the number of observations be at least five times the number of variables. When there are fewer observations than this, there is not enough information to accurately determine if outliers exist.

The minimum volume ellipsoid is not allowed to have zero volume, hence singular covariance matrices from subsamples are ignored (except for being counted). If your data has a covariance matrix that is singular, cov.mve will fail because all of the covariance matrices of the subsamples will be singular. In this case, you will need to modify your data before applying cov.mve, perhaps by using princomp and deleting columns with zero variance.

For univariate data sets, an exact algorithm location.lms is used. See the location.lms help file for more information.

Although the minimum volume ellipsoid covariance estimate has a very high breakdown point, it is inefficient. More efficiency can be attained while retaining the high breakdown point by performing a weighted covariance estimate with weights based on the minimum volume ellipsoid estimate. Such an estimate is what cov.mve returns. The Mahalanobis distance (computed using a scaling of the minimum volume ellipsoid covariance estimate) of each observation is compared to the Chisquare .975 quantile; those observations with smaller distances than this are given weight 1, and the others are given weight 0. The cov.wt function is then used with these weights. This was proposed in Rousseeuw and van Zomeren (1990).

BACKGROUND:

The minimum volume ellipsoid covariance estimator (Rousseeuw, 1985) has a breakdown point that is almost 50%. That is, the estimate cannot be made arbitrarily bad without changing about half of the data. A covariance matrix is considered to be arbitrarily bad if either a component goes to infinity (just as in the breakdown of a location or regression estimate), or if the matrix becomes deficient in rank. This is analogous to a scale estimate breaking down if the estimate is going either to infinity or to zero.

REFERENCES:

Burns, P. J. (1992). A genetic algorithm for robust regression estimation. (StatSci Technical Note).

Lopuhaa, H. P. and Rousseeuw, P. J. (1991). Breakdown points of affine equivariant estimators of multivariate location and covariance matrices. Annals of Statistics, 19, 229-248.

Rousseeuw, P. J. (1985). Multivariate estimation with high breakdown point. In Mathematical Statistics and Applications. W. Grossmann, G. Pflug, I. Vincze and W. Wertz, eds. Reidel: Dordrecht, 283-297.

Rousseeuw, P. J. (1991). A diagnostic plot for regression outliers and leverage points. Computational Statistics and Data Analysis, 11, 127-129.

Rousseeuw, P. J. and van Zomeren, B. C. (1990). Unmasking multivariate outliers and leverage points (with discussion). Journal of the American Statistical Association, 85, 633-651.

Woodruff, D. L. and Rocke, D. M. (1993). Heuristic search algorithms for the minimum volume ellipsoid estimator. Journal of Computational and Graphical Statistics, 2, 69-95.

SEE ALSO:

, , , , , .

EXAMPLES:

fr.cov <- cov.mve(freeny.x) 
cov.mve(freeny.x, stock=fr.cov$stock, births=1000)