k
clusters.
pam(x, k, diss=F, metric="euclidean", stand=F, save.x=T, save.diss=T)
diss
argument.
x
is typically the output of
daisy
or
dist
.
Also a vector with length n*(n-1)/2 is allowed
(where n is the number of observations),
and will be interpreted in the same way as the output
of the above-mentioned functions.
Missing values (NAs) are not allowed.
x
will be
considered as a dissimilarity matrix.
If FALSE, then
x
will be
considered as a matrix of observations by variables.
x
is already a dissimilarity matrix,
then this argument will be ignored.
x
are standardized before calculating the dissimilarities.
Measurements are standardized for each variable (column),
by subtracting the variable's mean value
and dividing by the variable's mean absolute deviation.
If
x
is already a dissimilarity matrix,
then this argument will be ignored.
"pam"
representing the clustering.
See
pam.object
for details.
pam
is fully described in chapter 2
of Kaufman and Rousseeuw (1990).
Compared to the k-means approach in
kmeans
,
the function
pam
has the following features:
(a) it also accepts a dissimilarity matrix;
(b) it is more robust because it minimizes a sum of dissimilarities
instead of a sum of squared euclidean distances;
(c) it provides a novel graphical display,
the silhouette plot (see
plot.partition
)
which also allows to select the number of clusters.
The
pam
-algorithm is based on
the search for
k
representative objects
or medoids among the observations of the dataset.
These observations should represent the structure of the data.
After finding a set of
k
medoids,
k
clusters are constructed by assigning
each observation to the nearest medoid.
The goal is to find
k
representative objects
which minimize the sum of the dissimilarities of the observations
to their closest representative object.
The algorithm first looks for a good initial set of medoids (this is called
the BUILD phase). Then it finds a local minimum for the objective function,
that is, a solution such that there is no single switch of an observation with
a medoid that will decrease the objective (this is called the SWAP phase).
Cluster analysis divides a dataset into groups (clusters) of observations that
are similar to each other.
Partitioning methods like
pam
,
clara
, and
fanny
require that the number of clusters be given by the user.
Hierarchical methods like
agnes
,
diana
, and
mona
construct a hierarchy of clusterings,
with the number of clusters ranging from one to the number of observations.
For datasets larger than (say) 200 observations,
pam
will take a lot of computation time.
Then the function
clara
is preferable.
Kaufman, L. and Rousseeuw, P. J. (1990).
Finding Groups in Data: An Introduction to Cluster Analysis.
Wiley, New York.
Struyf, A., Hubert, M. and Rousseeuw, P. J. (1997).
Integrating robust clustering techniques in S-PLUS.
Computational Statistics and Data Analysis,
26, 17-37.
# generate 25 objects, divided into 2 clusters. x <- rbind(cbind(rnorm(10,0,0.5), rnorm(10,0,0.5)), cbind(rnorm(15,5,0.5), rnorm(15,5,0.5))) pamx <- pam(x, 2) pamx summary(pamx) plot(pamx) pam(daisy(x, metric="manhattan"), 2, diss=T)