Creates a bivariate plot visualizing a partition (clustering) of the data. All
observation are represented by points in the plot, using principal
components or multidimensional scaling.
Around each cluster an ellipse is drawn.
data matrix or data frame, or dissimilarity matrix, depending on the value of
the
diss argument.
In case of a data matrix or data frame, each row corresponds to an observation,
and each column corresponds to a variable. All variables must be numeric.
Missing values (NAs) are allowed. They are replaced by the median of the
corresponding variable. When some variables or some observations contain only
missing values, the function stops with a warning message.
In case of a dissimilarity matrix,
x
is the output of
daisy
or
dist
or a symmetric matrix.
Also a vector with length n*(n-1)/2 is allowed
(where n is the number of observations),
and will be interpreted in the same way as the output
of the above-mentioned functions.
Missing values (NAs) are not allowed.
clus
a vector of length n representing a clustering
of
x.
For each observation the vector lists the number
or name of the cluster to which it has been assigned.
clus is often the clustering component
of the output of
pam,
fanny or
clara.
OPTIONAL ARGUMENTS:
diss
logical flag: if TRUE, then
x will be
considered as a dissimilarity matrix.
If FALSE, then
x will be
considered as a matrix of observations by variables.
cor
logical flag: this is only important when working with a data matrix or
data frame.
If TRUE, then the variables are scaled to have unit variance.
stand
logical flag: if TRUE, then the representations of the n observations in the
2-dimensional plot are standardized.
lines
integer: the currently available options are 0, 1 and 2.
This option is used to obtain an idea of the distances between ellipses.
The distance between two ellipses E1 and E2 is measured
along the line connecting the centers m1 and m2 of the two ellipses.
In case E1 and E2 overlap on the line through m1 and m2, no line is drawn.
Otherwise, the result depends on the value of the lines option.
If lines=0, no distance lines will appear on the plot.
If lines=1, then the line segment between m1 and m2 is drawn.
If lines=2, then a line segment between the boundaries of E1 and E2 is drawn
(along the line connecting m1 and m2).
shade
logical flag: if TRUE, then the ellipses are shaded in relation to their
density.
The density is the number of points in the cluster divided by the
area of the ellipse.
color
logical flag: if TRUE, then the ellipses are colored with respect to their
density.
With increasing density, the colors are light blue, light
green, red and purple.
To see these colors on the graphics device,
an appropriate color scheme should be selected in the menu
(we recommend a white background).
labels
integer: the currently available options are 0, 1, 2, 3, and 4.
If labels=0, then no labels are placed in the plot.
Using labels=1, points and ellipses can be identified in the plot
(see
identify).
If labels=2, then all points and ellipses are labeled in the plot.
When labels=3, only the points are labeled in the plot.
Using labels=4, only the ellipses are labeled in the plot.
The levels of the vector
clus are taken as labels for the clusters.
The labels of the points are the rownames of
x
if
x is a data frame or matrix.
When
diss=T and
x is a vector,
point labels can be attached to
x
as a "Labels" attribute (attr(x,"Labels")),
as is done for the output of
daisy.
A possible "names" attribute of the vector
clus will not be taken into account.
plotchar
logical flag: if TRUE, then the plotting symbols differ for points belonging
to different clusters.
span
logical flag: if TRUE, then each cluster is represented by the ellipse with
smallest area containing all its points.
(This is a special case of the minimum volume ellipsoid.)
If FALSE, the ellipse is based on the average
and covariance matrix of the same points,
often yielding a much larger ellipse.
There are also some special cases. When a cluster consists of only one point,
a tiny circle is drawn around it. When the points of a cluster fall on a
straight line, span=F draws a narrow ellipse around it and span=T gives the
exact line segment.
Graphical parameters may also be supplied as arguments to this function (see
par).
VALUE:
an invisible list with components:
Distances
When option lines is 1 or 2, a k by k matrix
(k is the number of clusters).
The element at row j and column s is the distance
between ellipse j and ellipse s.
If lines=0, then the value of this component is NA.
Shading
A vector of length k (where k is the number of clusters), containing the
amount of shading per cluster. Let y be a vector where element i is the
ratio between the number of points in cluster i and the area of ellipse i.
When the cluster i is a line segment, y[i] and the density of the cluster are
set to NA. Let z be the sum of all the elements of y without the NAs.
Then we put shading = y/z *37 + 3 .
SIDE EFFECTS:
a visual display of the clustering is plotted on the current graphics device.
DETAILS:
clusplot uses the functions
princomp
and
cmdscale.
These functions are data reduction techniques.
They will represent the data in a bivariate plot.
Ellipses are then drawn to indicate the clusters.
The further layout of the plot is determined by the optional arguments.
NOTE:
When we have 4 or fewer clusters,
then the option
color=T gives every cluster
a different color.
When there are more than 4 clusters,
clusplot
uses the function
pam to cluster
the densities into 4 groups,
such that ellipses with nearly the same density get the same color.
REFERENCES:
Kaufman, L. and Rousseeuw, P. J. (1990).
Finding Groups in Data: An Introduction to Cluster Analysis.
Wiley, New York.
Pison, G., Struyf, A. and Rousseeuw, P. J. (1997).
Displaying a Clustering with CLUSPLOT.
Technical Report, University of Antwerp, submitted.
Struyf, A., Hubert, M. and Rousseeuw, P. J. (1997).
Integrating robust clustering techniques in S-PLUS.
Computational Statistics and Data Analysis,
26, 17-37.
SEE ALSO:
,
,
,
,
,
,
,
,
.
EXAMPLES:
# Plotting votes.diss(dissimilarity) in a bivariate plot and
# partitioning into 2 clusters.
votes.diss <- daisy(votes.repub)
clusplot(votes.diss, pam(votes.diss, 2, diss=T)$clustering, diss=T,
shade=T, plotchar=T)
# Plotting iris (data frame) in a 2-dimensional plot and
# partitioning into 3 clusters.
iris.x <- rbind(iris[,,1],iris[,,2],iris[,,3])
clusplot(iris.x, pam(iris.x, 3)$clustering, diss=F, plotchar=T, color=T)