var(x, y, na.method="fail", unbiased=T, SumSquares=F, weights=NULL, freq=NULL, ...) cor(x, y, trim=0, na.method="fail", unbiased=T, weights=NULL, freq=NULL, ...) cov2cor(V)
bdFrame
.
If a matrix, columns represent variables and rows represent observations.
If a data frame or a
bdFrame
,
non-numeric variables result in missing values in the result.
bdFrame
.
If a matrix, columns represent variables and rows represent observations.
If a data frame or
bdFrame
,
non-numeric variables result in missing values in the result.
This argument must have the same number of observations
as
x
.
0.5
that specifies the
proportion trimmed in the internal calculations
for
cor
.
This should be larger than the suspected fraction of outliers.
"fail"
(stop if any missing data is found),
"omit"
(omit rows with any missing data),
"include"
(missing values in the input result in missing values in the output), and
"available"
(use available observations, see DETAILS below).
TRUE
, variances are the sample variances
sum((x-mean(x))^2)/(N-1)
for a vector of length N.
These variances are unbiased if the values in
x
are obtained by simple random sampling.
If
unbiased=FALSE
,
the definition
sum((x-mean(x))^2)/N
is used instead.
By default,
unbiased=TRUE
.
TRUE
,
the unnormalized sums of squares are returned,
with no division by either
N
or
N-1
.
In this case, the argument
unbiased
is ignored.
By default,
SumSquares=FALSE
.
x
. If present
arguments
unbiased
and
trim
can not be used, and
the definition used for variance is:
sum(weights * (x-mean(x, weights=weights))^2)/sum(weights).
SumSquares=TRUE
then division by
sum(weights)
is omitted.
x
, giving frequencies; should be integer values.
Calling
cor(x, freq=f)
is
equivalent to replicating observations in
x
using
freq
, e.g.
cor(x[rep(1:nrow(x), freq),,drop=FALSE])
.
cor
returns correlations
and
var
returns variances and covariances
(or sums of squares).
cov2cor
returns a correlation matrix like
V
.
If
x
is a matrix,
the result is a matrix such that the
[i,j]
element is the covariance (correlation) of
x[,i]
and either
y[,j]
or
x[,j]
.
If
x
is a vector,
the result is a vector with length equal to the number of columns in
y
(or length 1 if
y
is not supplied).
Covariances in the complex case are defined
as
sum(Conj(x-mean(x)) * (y-mean(y)))/(N-1)
if
unbiased=TRUE
,
where
N
is the number of rows in the matrix.
Trimmed correlations are computed by the standardized sums
and differences method.
Each variable is divided by a trimmed standard deviation.
For each pair of variables,
v(s)
is the trimmed variance of the sum of the standardized variables
and
v(d)
is the trimmed variance
of the difference of the standardized variables.
The correlation is then
(v(s) - v(d))/(v(s) + v(d))
.
Trimmed variances (and standard deviations) are calculated
by omitting the
N*trim
smallest and largest points.
If
N*trim
is not an integer,
it is not rounded; instead weighted sums are used.
See Gnanadesikan and Kettenring (1972), Huber (1981, pp 202-203),
or Gnanadesikan (1977, p 132) for more details.
Trimmed correlation matrices need not be positive definite;
see the last example below for an illustration.
Trimmed correlations handle missing data using
only
na.method="omit"
.
There is much discussion in the statistical literature concerning methods
for missing values.
See, for example, Little and Rubin (1987) or Schafer (1997).
The
"omit"
and
"available"
options
for the
na.method
argument are consistent
if missing values are "missing completely at random."
Informally, this means that whether a value is missing does
not depend on the values (observed or missing) of any of the variables.
If
na.method="available"
,
means and variances are computed for each variable using all nonmissing values,
and covariances for each pair of variables are computed using observations
with no missing data for that pair.
If
unbiased=TRUE
,
the divisor for the covariance of
x[,i]
and
x[,j]
(or
y[,j]
)
is
(N[i,j]-1+(1-N[i,j]/N[i])(1-N[i,j]/N[j]))
.
Here,
N[i,j]
is the number of observations
with both
x[,i]
and
x[,j]
present,
and
N[i]
is the number
with
x[,i]
present.
This is unbiased if the data are missing completely at random,
but it may give correlations outside the range -1 to 1.
There are various ad-hoc methods to correct this
and to force matrices to be positive definite,
but these are not used by
cor
and
var
.
For a more rigorous alternative, use the S+Missing Data library.
An alternative to these methods is to use maximum likelihood estimation
of variances or covariances
(under the assumption of joint normality),
using the
emGauss
function
in the S+Missing Data library.
If
na.method="include"
,
a missing value in the
j
th column
of
x
causes row
j
of the output to contain all
NA
s.
A missing value in the
k
th column
of
y
(or
x
if
y
is not supplied)
causes column
k
of the output
to contain all
NA
s.
Variances are computed by the numerically accurate corrected two-pass method described in Chan, Golub, and LeVeque (1983). Summations are done by adding results for groups of size 256, then adding the group sums. This is motivated by the numerically-accurate pairwise summation method described in the same article.
The correlation matrix returned by
cov2cor
is the result of dividing each row and each column by the square root of
its diagonal element.
Chan, T., Golub, G., and LeVeque, R. (1983). Algorithms for computing the sample variance: analysis and recommendations. The American Statistician 37: 242-247.
Gnanadesikan, R. (1977). Methods for Statistical Data Analysis of Multivariate Observations. New York: Wiley.
Gnanadesikan, R. and Kettenring, J.R. (1972). Robust estimates, residuals, and outlier detection with multiresponse data. Biometrics 28: 81-124.
Huber, P.J. (1981). Robust Statistics. New York: Wiley.
Little, R.J.A., and Rubin, D.R. (1987). Statistical Analysis with Missing Data. New York: Wiley.
Schafer, J.L. (1997). Analysis of Incomplete Multivariate Data. London: Chapman & Hall.
# 7 by 7 correlation matrix for the longley data cor(cbind(longley.x, longley.y)) # The same thing cov2cor(var(cbind(longley.x, longley.y))) # 6 by 1 matrix of covariances var(longley.x, longley.y) # Column variances diag(var(longley.x)) # A faster method for column variances colVars(longley.x) # Construct random missing data x <- longley.x x[runif(96) > .9] <- NA # This fails, since the default na.method is "fail" var(x) # This handles missing data var(x, na.method="available") # A trimmed correlation matrix that is not positive definite eigen(cor(testscores, trim=0.42))$values # EM estimates for multivariate normal library("missing") gaussFit <- emGauss(cholesterol) # covariance matrix covtest <- paramIter(gaussFit, expand=T)$sigma # standard deviations std <- sqrt(diag(cov)) # correlation matrix cov / outer(std, std)