hclust()
. A small function
naclus
is also provided which depicts similarities in which
observations are missing for variables in a data frame. The
similarity measure is the fraction of
NAs
in common between any two
variables. The diagonals of this
sim
matrix are the fraction of NAs
in each variable by itself.
naclus
also computes
na.per.obs
, the
number of missing variables in each observation, and
mean.na
, a
vector whose ith element is the mean number of missing variables other
than variable i, for observations in which variable i is missing. The
naplot
function makes several plots (see the
which
argument).
So as to not generate too many dummy variables for multi-valued
character or categorical predictors,
varclus
will automatically
combine infrequent cells of such variables using an auxiliary
function
combine.levels
that is defined here.
plotMultSim
plots multiple similarity matrices, with the similarity
measure being on the x-axis of each subplot.
na.pattern
prints a frequency table of all combinations of
missingness for multiple variables. If there are 3 variables, a
frequency table entry labeled
110
corresponds to the number of
observations for which the first and second variables were missing but
the third variable was not missing.
varclus(x, similarity=c("spearman","pearson","hoeffding","bothpos","ccbothpos"), type=c("data.matrix","similarity.matrix"), method=if(.R.)"complete" else "compact", data, subset, na.action, minlev=0.05) ## S3 method for class 'varclus': print(x, abbrev=FALSE, ...) ## S3 method for class 'varclus': plot(x, ylab, abbrev=FALSE, legend.=FALSE, loc, maxlen, labels, ...) naclus(df, method) naplot(obj, which=c('all','na per var','na per obs','mean na', 'na per var vs mean na'), ...) combine.levels(x, minlev=.05) plotMultSim(s, x=1:dim(s)[3], slim=range(pretty(c(0,max(s,na.rm=TRUE)))), slimds=FALSE, add=FALSE, lty=par('lty'), col=par('col'), lwd=par('lwd'), vname=NULL, h=.5, w=.75, u=.05, labelx=TRUE, xspace=.35) na.pattern(x)
x
is
a formula,
model.matrix
is used to convert it to a design matrix.
If the formula excludes an intercept (e.g.,
~ a + b -1
),
the first categorical (
factor
) variable in the formula will have
dummy variables generated for all levels instead of omitting one for
the first level. For
combine.levels
,
x
is a character, category,
or factor vector (or other vector that is converted to factor). For
plot
and
print
,
x
is an object created by
varclus
. For
na.pattern
,
x
is a list, data frame,
or numeric matrix.
For
plotMultSim
, is a numeric vector specifying the ordered
unique values on the x-axis, corresponding to the third dimension of
s
.
varclus
, for example. A
use for this might be to show pairwise similarities of variables
across time in a longitudinal study (see the example below). If
vname
is not given,
s
must have
dimnames
.
similarity="bothpos"
uses as
a similarity measure the proportion of observations for which two
variables are both positive.
similarity="ccbothpos"
uses a
chance-corrected measure which is the proportion of observations for
which both variables are positive minus the product of the two
marginal proportions. This difference is expected to be zero under
independence. For diagonals,
"ccbothpos"
still uses the proportion
of positives for the single variable. So
"ccbothpos"
is not really
a similarity measure, and clustering is not done. This measure is
useful for plotting with
plotMultSim
(see the last example).
x
is not a formula, it may be a data matrix or a similarity matrix.
By default, it is assumed to be a data matrix.
hclust
. The default, for both
varclus
and
naclus
, is
"compact"
(for R it is
"complete"
).
x
is a formula. The default
na.action
is
na.retain
, defined by
varclus
. This causes all observations to
be kept in the model frame, with later pairwise deletion of
NA
s.
similarity
.
TRUE
to plot a legend defining the abbreviations
x
and
y
defining coordinates of the
upper left corner of the legend. Default is
locator(1)
.
maxlen
characters are truncated at
maxlen
.
plclust
(or to
dotchart
or
dotchart2
for
naplot
).
naclus
"all"
meaning to have
naplot
make 4 separate
plots. To
make only one of the plots, use
which="na per var"
(dot chart of
fraction of NAs for each variable), ,
"na per obs"
(dot chart showing
frequency distribution of number of variables having NAs in an
observation),
"mean na"
(dot chart showing mean number of other
variables missing when the indicated variable is missing), or
"na per var vs mean na"
, a scatterplot showing on the x-axis the
fraction of NAs in the variable and on the y-axis the mean number of
other variables that are NA when the indicated variable is NA.
"OTHER"
. Otherwise, the lowest frequency cell is combined
with the next lowest frequency cell, and the level name is the
combination of the two old level levels.
TRUE
to abbreviate variable names for plotting or
printing. Is set to
TRUE
automatically if
legend=TRUE
.
s
.
slimds
to
TRUE
to scale diagonals and
off-diagonals separately
TRUE
to add similarities to an existing plot (usually
specifying
lty
or
col
)
plotMultSim
s
FALSE
to suppress drawing of labels in the x direction
n
where
n
is the number
of variables, to set aside for y-axis labels
options(contrasts= c("contr.treatment", "contr.poly"))
is issued
temporarily by
varclus
to make sure that ordinary dummy variables
are generated for
factor
variables. If a categorical or character
variable has no level containing at least a fraction
minlev
of the
data, that variable is omitted from consideration and a warning is
printed.
varclus
or
naclus
, a list of class
varclus
with elements
call
(containing the calling statement),
sim
(similarity matrix),
n
(sample size used if
x
was not a correlation matrix already -
n
is a matrix),
hclust
, the object created by
hclust
,
similarity
, and
method
. For
plot
, returns the object created by
plclust
.
naclus
also returns the two vectors listed under
description, and
naplot
returns an invisible vector that is the
frequency table of the number of missing variables per observation.
plotMultSim
invisibly returns the limits of similarities used in
constructing the y-axes of each subplot. For
similarity="ccbothpos"
the
hclust
object is
NULL
.
na.pattern
creates an integer vector of frequencies.
Frank Harrell
Department of Biostatistics, Vanderbilt University
f.harrell@vanderbilt.edu
Sarle, WS: The VARCLUS Procedure. SAS/STAT User's Guide, 4th Edition, 1990. Cary NC: SAS Institute, Inc.
Hoeffding W. (1948): A non-parametric test of independence. Ann Math Stat 19:546–57.
set.seed(1) x1 <- rnorm(200) x2 <- rnorm(200) x3 <- x1 + x2 + rnorm(200) x4 <- x2 + rnorm(200) x <- cbind(x1,x2,x3,x4) v <- varclus(x, similarity="spear") # spearman is the default anyway v # invokes print.varclus print(round(v$sim,2)) plot(v) # plot(varclus(~ age + sys.bp + dias.bp + country - 1), abbrev=TRUE) # the -1 causes k dummies to be generated for k countries # plot(varclus(~ age + factor(disease.code) - 1)) # df <- data.frame(a=c(1,2,3),b=c(1,2,3),c=c(1,2,NA),d=c(1,NA,3), e=c(1,NA,3),f=c(NA,NA,NA),g=c(NA,2,3),h=c(NA,NA,3)) par(mfrow=c(2,2)) for(m in if(.R.)c("ward","complete","median") else c("compact","connected","average")) { plot(naclus(df, method=m)) title(m) } naplot(naclus(df)) n <- naclus(df) plot(n); naplot(n) na.pattern(df) # builtin function x <- c(1, rep(2,11), rep(3,9)) combine.levels(x) x <- c(1, 2, rep(3,20)) combine.levels(x) # plotMultSim example: Plot proportion of observations # for which two variables are both positive (diagonals # show the proportion of observations for which the # one variable is positive). Chance-correct the # off-diagonals by subtracting the product of the # marginal proportions. On each subplot the x-axis # shows month (0, 4, 8, 12) and there is a separate # curve for females and males d <- data.frame(sex=sample(c('female','male'),1000,TRUE), month=sample(c(0,4,8,12),1000,TRUE), x1=sample(0:1,1000,TRUE), x2=sample(0:1,1000,TRUE), x3=sample(0:1,1000,TRUE)) s <- array(NA, c(3,3,4)) opar <- par(mar=c(0,0,4.1,0)) # waste less space for(sx in c('female','male')) { for(i in 1:4) { mon <- (i-1)*4 s[,,i] <- varclus(~x1 + x2 + x3, sim='ccbothpos', data=d, subset=month==mon & sex==sx)$sim } plotMultSim(s, c(0,4,8,12), vname=c('x1','x2','x3'), add=sx=='male', slimds=TRUE, lty=1+(sx=='male')) # slimds=TRUE causes separate scaling for diagonals and # off-diagonals } par(opar)