Use cov.mve with a formula Object

DESCRIPTION:

Returns a list of class mve containing robust estimates of the covariance matrix, the location of the data, and optionally the robust correlation matrix. Specifically, the cov.mve.formula function returns weighted estimates, with weights based on the minimum volume ellipsoid estimator proposed by Rousseeuw (1985). This is a method for the function cov.mve for formula objects.

USAGE:

cov.mve.formula(formula, data=<<see below>>, weights, subset=<<see below>>, 
                na.action=na.fail, model=F, x=F, cor=F, print=T, 
                popsize=<<see below>>, mutate.prob=c(0.15,0.2,0.2,0.2), 
                random.n=<<see below>>, births.n=<<see below>>, 
                stock=list(), maxslen=<<see below>>, 
                stockprob=<<see below>>, nkeep=1, nsamp==<<see below>>) 

REQUIRED ARGUMENTS:

formula
a formula object, with only terms, separated by + operators, on the right of a ~ operator.

OPTIONAL ARGUMENTS:

data
a data frame in which to interpret the variables named in the formula, or in the subset argument. If this is missing, then the variables in the formula should be on the search list. This may also be a single number to handle some special cases -- see below for details.
subset
expression saying which subset of the rows of the data should be used. This can be a logical vector which is replicated to have length equal to the number of observations, a numeric vector indicating which observation numbers are to be included, or a character vector of the row names to be included. All observations are included by default.
weights
the current version of cov.mve.formula does not allow input weights.
na.action
a function to filter missing data. This is applied to the model.frame after any subset argument has been used. The default (with na.fail) is to create an error if any missing values are found. A possible alternative is na.exclude, which deletes observations that contain one or more missing values.
model
logical flag: if TRUE, the model frame is returned in component model.
x
logical flag: if TRUE, the model matrix is returned in component x.
cor
logical flag: if TRUE, then the estimated correlation matrix will be returned as well.
print
logical flag: if TRUE, a message about the number of samples taken and the number of those samples that were singular will be printed.
popsize
the population size of the genetic stock. The default is 10 times the number of variables.
mutate.prob
length 4 vector of mutation probabilities for offspring. The first element is the probability of a mutation to one observation in the offspring. The second through fourth elements give the probability that the length of the offspring will be one shorter than the mother, one longer than the mother, or a random length, respectively.
random.n
the number of random samples taken after the stock is filled. The default is 50 times the number of variables.
births.n
the number of genetic births. The default is (100*p)+(20*p^2), where p is the number of variables.
stock
a list of vectors of observation numbers to be included in the stock. This is typically the stock component of the output of a previous call to the function.
maxslen
the maximum number of observations (including duplicates) in a member of the stock. The default is p+1 if (n-p)/2 is less than p+1, where n is the number of observations, and it is the minimum of trunc((n-p)/2) and 5*p otherwise.
stockprob
vector of cumulative probabilities that a member of the stock will be chosen as a parent. The ith element corresponds to the individual with the ith lowest objective. The default is cumsum((2 * (popsize:1))/popsize/(popsize + 1)).
nkeep
the number of individuals in the stock to keep in the output.
nsamp
the total number of samples taken after the stock is filled. nsamp is always popsize + births.n + random.n - length(stock). The default value is the result of the right hand side of the above equation.

VALUE:

an object of class "mve" representing the minimum volume ellipsoid covariance estimation. See the mve.object help file for details.

SIDE EFFECTS:

For multivariate data sets: creates the dataset .Random.seed if it does not already exist, otherwise its value is updated.

If print is TRUE, then a message is printed.

DETAILS:

The formula argument is passed around unevaluated; that is, the variables mentioned in the formula will be defined when the model frame is computed, not when cov.mve.formula is initially called. In particular, if data is given, all these names should generally be defined as variables in that data frame.

The subset argument, like the terms in formula, is evaluated in the context of the data frame, if present. The specific action of the argument is as follows: the model frame, including subset, is computed on all the rows, and then the appropriate subset is extracted. A variety of special cases make such an interpretation desirable (e.g., the use of lag or other functions that may need more than the data used in the computation to be fully defined). On the other hand, if you meant the subset to avoid computing undefined values or to escape warning messages, you may be surprised. For example, cov.mve(~ log(x), mydata, subset = x > 0) will still generate warnings from log. If this is a problem, do the subsetting on the data frame directly: cov.mve(~ log(x), mydata[,mydata$x > 0])

cov.mve.default is called when the model frame has been computed. See the cov.mve.default help file for details on the computational algorithm.

NAMES. Variables occurring in a formula are evaluated differently from arguments to S-PLUS functions, because the formula is an object that is passed around unevaluated from one function to another. The functions such as cov.mve.formula that finally arrange to evaluate the variables in the formula try to establish a context based on the data argument. More precisely, the function model.frame.default does the actual evaluation, assuming that its caller behaves in the way described here. If the data argument to cov.mve.formula is missing or is an object (typically, a data frame), then the local context for variable names is the frame of the function that called cov.mve.formula, or the top-level expression frame if the user called cov.mve.formula directly. Names in the formula can refer to variables in the local context as well as global variables or variables in the data object.

The data argument can also be a number, in which case that number defines the local context. This can arise, for example, if a function is written to call cov.mve.formula , perhaps in a loop, but the local context is definitely not that function. In this case, the function can set data to sys.parent(), and the local context will be the next function up the calling stack. See the second example below. A numeric value for data can also be supplied if a local context is being explicitly created by a call to new.frame. Notice that supplying data as a number implies that this is the only local context; local variables in any other function will not be available when the model frame is evaluated. This is potentially subtle. Fortunately, it is not something the ordinary user of cov.mve.formula needs to worry about. It is relevant for those writing functions that call cov.mve.formula.

REFERENCES:

Burns, P. J. (1992). A genetic algorithm for robust regression estimation. (StatSci Technical Note).

Lopuhaa, H. P. and Rousseeuw, P. J. (1991). Breakdown points of affine equivariant estimators of multivariate location and covariance matrices. Annals of Statistics, 19, 229-248.

Rousseeuw, P. J. (1985). Multivariate estimation with high breakdown point. In Mathematical Statistics and Applications. W. Grossmann, G. Pflug, I. Vincze and W. Wertz, eds. Reidel: Dordrecht, 283-297.

Rousseeuw, P. J. (1991). A diagnostic plot for regression outliers and leverage points. Computational Statistics and Data Analysis, 11, 127-129.

Rousseeuw, P. J. and van Zomeren, B. C. (1990). Unmasking multivariate outliers and leverage points (with discussion). Journal of the American Statistical Association, 85, 633-651.

Woodruff, D. L. and Rocke, D. M. (1993). Heuristic search algorithms for the minimum volume ellipsoid estimator. Journal of Computational and Graphical Statistics, 2, 69-95.

SEE ALSO:

, , , , , , .

EXAMPLES:

cov.mve(~wind+radiation+temperature, data=air) 
# mymve calls cov.mve, using the caller to mymve 
# as the local context for variables in the formula 
# (see aov for an actual example) 
mymve <- function(formula, data = sys.parent(), ...) { 
    .. .. 
    mve <- cov.mve(formula, data, ...) 
    .. .. 
}