k
clusters.
clara(x, k, metric="euclidean", stand=F, samples=5, sampsize=40 + 2 * k, save.x=T, save.diss=T)
x
are standardized before
calculating the dissimilarities. Measurements are standardized for each
variable (column), by subtracting the variable's mean value and dividing by
the variable's mean absolute deviation.
sampsize
should be higher
than the number of clusters (
k
) and at most the number of observations
(nrow(
x
)).
"clara"
representing the clustering.
See clara.object for details.
clara
is fully described in chapter 3 of Kaufman and Rousseeuw (1990).
Compared to other partitioning methods such as
pam
, it can deal with
much larger datasets. Internally, this is achieved by considering
sub-datasets of fixed size, so that the time and storage requirements
become linear in nrow(
x
) rather than quadratic.
Each sub-dataset is partitioned into
k
clusters using the same
algorithm as in the
pam
function.
Once
k
representative objects have been selected from the
sub-dataset, each observation of the entire dataset is assigned
to the nearest medoid.
The sum of the dissimilarities of the observations to their closest medoid, is
used as a measure of the quality of the clustering. The sub-dataset
for which the sum is minimal, is retained.
A further analysis is carried out on the final partition.
Each sub-dataset is forced to contain the medoids obtained from the best
sub-dataset until then.
Randomly drawn observations are added to this set until
sampsize
has been reached.
Cluster analysis divides a dataset into groups (clusters) of observations that
are similar to each other.
Partitioning methods like
pam
,
clara
,
and
fanny
require that the number of clusters be given by the user.
Hierarchical methods like
agnes
,
diana
,
and
mona
construct a hierarchy of clusterings,
with the number of clusters ranging from one to the number of observations.
For small datasets (say with fewer than 200 observations),
the function
pam
can be used directly.
Kaufman, L. and Rousseeuw, P. J. (1990).
Finding Groups in Data: An Introduction to Cluster Analysis.
Wiley, New York.
Struyf, A., Hubert, M. and Rousseeuw, P. J. (1997).
Integrating robust clustering techniques in S-PLUS.
Computational Statistics and Data Analysis,
26, 17-37.
# generate 500 objects, divided into 2 clusters. x <- rbind(cbind(rnorm(200,0,8), rnorm(200,0,8)), cbind(rnorm(300,50,8), rnorm(300,50,8))) clarax <- clara(x, 2) clarax clarax$clusinfo plot(clarax)