Variance, Covariance, and Correlation

DESCRIPTION:

Returns the variance of a vector, the variance-covariance (or correlation) matrix of a data matrix, or covariances between matrices or vectors. A trimming fraction may be specified for correlations. Convert a variance-covariance matrix to a correlation matrix.

USAGE:

var(x, y, na.method="fail", unbiased=T, SumSquares=F,  
    weights=NULL, freq=NULL, ...) 
cor(x, y, trim=0, na.method="fail", unbiased=T, 
    weights=NULL, freq=NULL, ...) 
cov2cor(V)

REQUIRED ARGUMENTS:

x
a numeric (or complex) matrix, vector, data frame, or bdFrame. If a matrix, columns represent variables and rows represent observations. If a data frame or a bdFrame, non-numeric variables result in missing values in the result.
V
for the cov2cor function, a covariance matrix. Must be square. Missing values are allowed but will result in missing values appearing in result.

OPTIONAL ARGUMENTS:

y
a numeric (or complex) matrix, vector, data frame, or bdFrame. If a matrix, columns represent variables and rows represent observations. If a data frame or bdFrame, non-numeric variables result in missing values in the result. This argument must have the same number of observations as x.
trim
a number less than 0.5 that specifies the proportion trimmed in the internal calculations for cor. This should be larger than the suspected fraction of outliers.
na.method
a character string specifying how missing values are to be handled. Options are:
"fail" (stop if any missing data is found),
"omit" (omit rows with any missing data),
"include" (missing values in the input result in missing values in the output), and
"available" (use available observations, see DETAILS below).
Only enough of the string to determine a unique match is required.
unbiased
logical value. If TRUE, variances are the sample variances sum((x-mean(x))^2)/(N-1) for a vector of length N. These variances are unbiased if the values in x are obtained by simple random sampling. If unbiased=FALSE, the definition sum((x-mean(x))^2)/N is used instead. By default, unbiased=TRUE.
SumSquares
logical value. If TRUE, the unnormalized sums of squares are returned, with no division by either N or N-1. In this case, the argument unbiased is ignored. By default, SumSquares=FALSE.
weights
a vector with the same number of observations as x. If present arguments unbiased and trim can not be used, and the definition used for variance is:
sum(weights * (x-mean(x, weights=weights))^2)/sum(weights).
If SumSquares=TRUE then division by sum(weights) is omitted.
freq
a vector with the same number of observations as x, giving frequencies; should be integer values. Calling cor(x, freq=f) is equivalent to replicating observations in x using freq, e.g. cor(x[rep(1:nrow(x), freq),,drop=FALSE]).

VALUE:

cor returns correlations and var returns variances and covariances (or sums of squares). cov2cor returns a correlation matrix like V.

If x is a matrix, the result is a matrix such that the [i,j] element is the covariance (correlation) of x[,i] and either y[,j] or x[,j]. If x is a vector, the result is a vector with length equal to the number of columns in y (or length 1 if y is not supplied).

DETAILS:

Covariances in the complex case are defined as sum(Conj(x-mean(x)) * (y-mean(y)))/(N-1) if unbiased=TRUE, where N is the number of rows in the matrix.

Trimmed correlations are computed by the standardized sums and differences method. Each variable is divided by a trimmed standard deviation. For each pair of variables, v(s) is the trimmed variance of the sum of the standardized variables and v(d) is the trimmed variance of the difference of the standardized variables. The correlation is then (v(s) - v(d))/(v(s) + v(d)). Trimmed variances (and standard deviations) are calculated by omitting the N*trim smallest and largest points. If N*trim is not an integer, it is not rounded; instead weighted sums are used. See Gnanadesikan and Kettenring (1972), Huber (1981, pp 202-203), or Gnanadesikan (1977, p 132) for more details. Trimmed correlation matrices need not be positive definite; see the last example below for an illustration. Trimmed correlations handle missing data using only na.method="omit".

There is much discussion in the statistical literature concerning methods for missing values. See, for example, Little and Rubin (1987) or Schafer (1997). The "omit" and "available" options for the na.method argument are consistent if missing values are "missing completely at random." Informally, this means that whether a value is missing does not depend on the values (observed or missing) of any of the variables.

If na.method="available", means and variances are computed for each variable using all nonmissing values, and covariances for each pair of variables are computed using observations with no missing data for that pair. If unbiased=TRUE, the divisor for the covariance of x[,i] and x[,j] (or y[,j]) is (N[i,j]-1+(1-N[i,j]/N[i])(1-N[i,j]/N[j])). Here, N[i,j] is the number of observations with both x[,i] and x[,j] present, and N[i] is the number with x[,i] present. This is unbiased if the data are missing completely at random, but it may give correlations outside the range -1 to 1. There are various ad-hoc methods to correct this and to force matrices to be positive definite, but these are not used by cor and var. For a more rigorous alternative, use the S+Missing Data library.

An alternative to these methods is to use maximum likelihood estimation of variances or covariances (under the assumption of joint normality), using the emGauss function in the S+Missing Data library.

If na.method="include", a missing value in the jth column of x causes row j of the output to contain all NAs. A missing value in the kth column of y (or x if y is not supplied) causes column k of the output to contain all NAs.

Variances are computed by the numerically accurate corrected two-pass method described in Chan, Golub, and LeVeque (1983). Summations are done by adding results for groups of size 256, then adding the group sums. This is motivated by the numerically-accurate pairwise summation method described in the same article.

The correlation matrix returned by cov2cor is the result of dividing each row and each column by the square root of its diagonal element.

REFERENCES:

Chan, T., Golub, G., and LeVeque, R. (1983). Algorithms for computing the sample variance: analysis and recommendations. The American Statistician 37: 242-247.

Gnanadesikan, R. (1977). Methods for Statistical Data Analysis of Multivariate Observations. New York: Wiley.

Gnanadesikan, R. and Kettenring, J.R. (1972). Robust estimates, residuals, and outlier detection with multiresponse data. Biometrics 28: 81-124.

Huber, P.J. (1981). Robust Statistics. New York: Wiley.

Little, R.J.A., and Rubin, D.R. (1987). Statistical Analysis with Missing Data. New York: Wiley.

Schafer, J.L. (1997). Analysis of Incomplete Multivariate Data. London: Chapman & Hall.

SEE ALSO:

, , , , , .

EXAMPLES:

# 7 by 7 correlation matrix for the longley data
cor(cbind(longley.x, longley.y))
# The same thing
cov2cor(var(cbind(longley.x, longley.y)))

# 6 by 1 matrix of covariances
var(longley.x, longley.y)

# Column variances
diag(var(longley.x))
# A faster method for column variances
colVars(longley.x)

# Construct random missing data
x <- longley.x
x[runif(96) > .9] <- NA
# This fails, since the default na.method is "fail"
var(x)
# This handles missing data
var(x, na.method="available")

# A trimmed correlation matrix that is not positive definite
eigen(cor(testscores, trim=0.42))$values

# EM estimates for multivariate normal
library("missing")
gaussFit <- emGauss(cholesterol)

# covariance matrix
covtest <- paramIter(gaussFit, expand=T)$sigma

# standard deviations
std <- sqrt(diag(cov))

# correlation matrix
cov / outer(std, std)