euclidean
,
maximum
,
manhattan
,
and
binary
.
dist(x, metric = "euclidean")
x
. Missing values (
NA
s) are
allowed.
"euclidean"
,
"maximum"
,
"manhattan"
,
and
"binary"
.
Euclidean distances are root sum-of-squares of differences,
"maximum"
is the maximum difference,
"manhattan"
is the sum of absolute differences,
and
"binary"
is the proportion
of non-zeros that two vectors do not have in common
(the number of occurrences of a zero and a one, or a one and a zero
divided by the number of times at least one vector has a one).
x
.
Since there are many distances
and since the result of
dist
is
typically an argument to
hclust
or
cmdscale
, a vector is returned,
rather than a symmetric matrix.
For
i
less than
j
,
the distance between row
i
and row
j
is element
nrow(x)*(i-1) - i*(i-1)/2 + j-i
of the result.
The returned object has an attribute,
Size, giving
the number of objects, that is,
nrow(x)
.
The length of the vector that is returned is
nrow(x)*(nrow(x)-1)/2
,
that is, it is of order
nrow(x)
squared.
Missing values in a row of
x
are not included
in any distances involving that row.
If the metric is
"euclidean"
and
ng
is the number of columns in which
no missing values occur for the given rows,
then the distance returned is
sqrt(ncol(x)/ng)
times the Euclidean distance between the two vectors
of length
ng
shortened
to exclude
NA
s.
The rule is similar for the
"manhattan"
metric,
except that the coefficient is
ncol(x)/ng
.
The
"binary"
metric excludes
columns in which either row has an
NA
.
If all values for a particular distance are excluded,
the distance is
NA
.
If the columns of a matrix are in different units,
it is usually advisable to scale the matrix before
using
dist
.
A column that is much more variable than the others
will dominate the distance measure.
Distance measures are used in cluster analysis and in multidimensional scaling. The choice of metric may have a large impact.
Everitt, B. (1980). Cluster Analysis (second edition). Halsted, New York.
Mardia, K. V., Kent, J. T. and Bibby, J. M. (1979). Multivariate Analysis. Academic Press, London.
# create a sample object x <- votes.repub dist(x, "max") # distances among rows by maximum dist(t(x)) # distances among cols in Euclidean metric # Below is a function that converts a distance structure to a matrix dist2full <- function(dis) { n <- attr(dis, "Size") full <- matrix(0, n, n) full[lower.tri(full)] <- dis full + t(full) }