compact
(also called complete linkage),
average
, and
connected
(also called single linkage).
hclust(dist, method = "compact", sim =)
dist
or
sim
must be specified.
dist
, but it can be any data of the
form returned by
dist
, or a full, symmetric matrix.
For details on the required structure, see the
dist
help file.
Missing values are not allowed.
"average"
,
"connected"
(single linkage) and
"compact"
(complete linkage).
(The first three characters of the
method are sufficient.)
"Size"
attribute.
For details on the required structure, see the
dist
help file.
Missing values are not allowed.
n-1
) by 2 matrix, if there were
n
objects in the
original data. Row
i
of
merge
describes the
merging of clusters at step
i
of the clustering.
If an element
j
in the row is negative, then object
-j
was merged at this stage. If
j
is positive,
then the merge was with the cluster formed at
the (earlier) stage
j
of the algorithm.
At each stage the two "nearest" clusters are combined to form one bigger cluster
(initially each cluster contains a single point).
For
method="connected"
the distance between two clusters is the
minimum distance between a point in the first cluster and a point in the
second cluster. In
method="average"
the distance between clusters
is the average of the distances between the points in one cluster and
the points in the other cluster.
The largest distance between a point in one cluster and a point in the other
cluster is used in
method="compact"
.
Long thin clusters are typically created with
"connected"
, and more spherical
clusters are formed with
"compact"
.
In hierarchical cluster displays, a decision is needed
at each merge to specify which subtree should go on the left
and which on the right.
Since, for
n
individuals, there are
n--1
merges, there are
2^(n--1)
possible orderings for the leaves in a cluster tree.
The default algorithm in
hclust
is to order the
subtrees so that the tighter cluster is on the left
(the last merge of the left subtree is at a lower value than
the last merge of the right subtree).
Individuals are the tightest clusters possible, and merges involving
two individuals place them in order by their observation number.
Cluster analysis divides datapoints into groups of points that are "close"
to each other. The
hclust
function continues to aggregate groups together
until there is just one big group. If it is necessary to choose the
number of groups, this can be decided subsequently.
Other methods (see
kmeans
) require that the number of groups be
decided from the start.
By changing the distance metric and the clustering method, several different
cluster trees can be created from a single dataset.
No one method seems to be useful in all situations. Single linkage
(
"connected"
) can
work poorly if two distinct groups have a few "stragglers" between them.
Everitt, B. (1980).
Cluster Analysis
(second edition). Halsted, New York.
Hartigan, J. A. (1975).
Clustering Algorithms.
Wiley. New York.
Many multivariate statistics books also include a discussion on clustering.
The functions
plclust
and
labclust
are used for plotting
the result of a hierarchical clustering.
dist
computes distance matrices.
Functions
cutree
,
clorder
, and
subtree
can be used
to manipulate the tree data structure.
kmeans
performs a non-hierarchical clustering.
# create a sample object using a built-in dataset x <- longley.y h <- hclust(dist(x)) plclust(h) hclust(dist(x, metric="maximum"), "ave") votes.clust <- hclust(dist(votes.repub), "ave") plclust(votes.clust, label=state.abb)