centers
.
kmeans(x, centers, iter.max=10)
centers
is an integer,
hclust
and
cutree
will be used to get initial values.
If
centers
is a matrix, each row
represents a cluster center, and thus
centers
must have the same number of columns as
x
.
The number of rows in
centers
,
(there must be at least two),
is the number of clusters that will be formed.
Missing values are not accepted.
kmeans
with the following components:
1
to
nrow(centers)
,
with length the same as the number of rows of
x
.
The
i
th value indicates
the cluster in which the
i
th data point belongs.
centers
containing the locations of the final
cluster centers.
Each row is a cluster center location.
nrow(centers)
.
The
i
th value gives the within cluster sum of squares for the
i
th cluster.
nrow(centers)
.
The
i
th value gives the number of data points in cluster
i
.
The object is to find a partition of the observations with
nrow(centers)
groups that minimizes
sum(withinss)
.
To actually guarantee the minimum would be computationally infeasible in many
settings; this function finds a local minimum, that is, a solution such
that there is no single switch of an observation from one group
to another group that will decrease the objective.
The procedure used to achieve the local minimum is rather complex - see
Hartigan and Wong (1979) for details.
It may be necessary to scale the columns of
x
in order for the clustering
to be sensible. The larger a variable's variance, the more important it will
be to the clustering.
When deciding on the number of clusters, Hartigan (1975, pp 90-91) suggests
the following rough rule of thumb.
If
k
is the result of
kmeans
with k groups
and
kplus1
is the result with k+1 groups, then it is justifiable to add the
extra group when
(sum(k$withinss)/sum(kplus1$withinss)-1)*(nrow(x)-
k
-1)
is greater than
10
.
Hartigan, J. A. (1975).
Clustering Algorithms.
New York: Wiley.
Hartigan, J. A. and Wong, M. A. (1979). A k-means clustering algorithm.
Applied Statistics
28, 100-108.
irismean <- t(apply(iris, c(2, 3), 'mean')) x <- rbind(iris[,,1], iris[,,2], iris[,,3]) km <- kmeans(x, irismean) wrong <- km$cluster!=rep(1:3, c(50, 50, 50)) spin(x, highlight=wrong) plot(x[,2], x[,3], type="n") text(x[!wrong, 2], x[!wrong, 3], km$cluster) # identify cluster membership that is correct points(x[wrong, 2], x[wrong, 3], pch=15) # boxes for points in error title(main="K-Means Clustering of the Iris Data")