D
statistics for all possible
pairs of columns of a matrix.
D
is a measure of the distance
between
F(x,y)
and
G(x)H(y)
, where
F(x,y)
is the joint CDF of
X
and
Y
,
and
G
and
H
are marginal CDFs. Missing values are deleted in pairs rather than deleting all rows
of
x
having any missing variables.
The
D
statistic is robust against a wide
variety of alternatives to independence, such as non-monotonic relationships.
The larger the value of
D
, the more dependent are
X
and
Y
(for many types
of dependencies).
D
used here is 30 times Hoeffding's original
D
, and
ranges from -0.5 to 1.0 if there are no ties in the data.
print.hoeffd
prints the information derived by
hoeffd
. The higher
the value of
D
, the more dependent are
x
and
y
.
hoeffd(x) hoeffd(x, y) ## S3 method for class 'hoeffd': print(x, ...)
y
is absent), or an object created by
hoeffd
x
Uses midranks in case of ties, as described by Hollander and Wolfe.
P-values are approximated by linear interpolation on the table
in Hollander and Wolfe, which uses the asymptotically equivalent
Blum-Kiefer-Rosenblatt statistic. For
P<.0001
or
>0.5
,
P
values are
computed using a well-fitting linear regression function in
log P
vs.
the test statistic.
Ranks (but not bivariate ranks) are computed using efficient
algorithms (see reference 3).
D
, the
matrix of D statistics,
n
the
matrix of number of observations used in analyzing each pair of variables,
and
P
, the asymptotic P-values.
Pairs with fewer than 5 non-missing values have the D statistic set to NA.
The diagonals of
n
are the number of non-NAs for the single variable
corresponding to that row and column.
Frank Harrell
Department of Biostatistics
Vanderbilt University
f.harrell@vanderbilt.edu
Hoeffding W. (1948): A non-parametric test of independence. Ann Math Stat 19:546–57.
Hollander M. and Wolfe D.A. (1973). Nonparametric Statistical Methods, pp. 228–235, 423. New York: Wiley.
Press WH, Flannery BP, Teukolsky SA, Vetterling, WT (1988): Numerical Recipes in C. Cambridge: Cambridge University Press.
x <- c(-2, -1, 0, 1, 2) y <- c(4, 1, 0, 1, 4) z <- c(1, 2, 3, 4, NA) q <- c(1, 2, 3, 4, 5) hoeffd(cbind(x,y,z,q)) # Hoeffding's test can detect even one-to-many dependency set.seed(1) x <- seq(-10,10,length=200) y <- x*sign(runif(200,-1,1)) plot(x,y) hoeffd(x,y)