centers.
kmeans(x, centers, iter.max=10)
centers is an integer,
hclust and
cutree will be used to get initial values.
If
centers is a matrix, each row
represents a cluster center, and thus
centers
must have the same number of columns as
x.
The number of rows in
centers,
(there must be at least two),
is the number of clusters that will be formed.
Missing values are not accepted.
kmeans with the following components:
1 to
nrow(centers),
with length the same as the number of rows of
x.
The
ith value indicates
the cluster in which the
ith data point belongs.
centers containing the locations of the final
cluster centers.
Each row is a cluster center location.
nrow(centers).
The
ith value gives the within cluster sum of squares for the
ith cluster.
nrow(centers).
The
ith value gives the number of data points in cluster
i.
The object is to find a partition of the observations with
nrow(centers)
groups that minimizes
sum(withinss).
To actually guarantee the minimum would be computationally infeasible in many
settings; this function finds a local minimum, that is, a solution such
that there is no single switch of an observation from one group
to another group that will decrease the objective.
The procedure used to achieve the local minimum is rather complex - see
Hartigan and Wong (1979) for details.
It may be necessary to scale the columns of
x in order for the clustering
to be sensible. The larger a variable's variance, the more important it will
be to the clustering.
When deciding on the number of clusters, Hartigan (1975, pp 90-91) suggests
the following rough rule of thumb.
If
k is the result of
kmeans with k groups
and
kplus1 is the result with k+1 groups, then it is justifiable to add the
extra group when
(sum(k$withinss)/sum(kplus1$withinss)-1)*(nrow(x)-k
-1)
is greater than
10.
Hartigan, J. A. (1975).
Clustering Algorithms.
New York: Wiley.
Hartigan, J. A. and Wong, M. A. (1979). A k-means clustering algorithm.
Applied Statistics
28, 100-108.
irismean <- t(apply(iris, c(2, 3), 'mean')) x <- rbind(iris[,,1], iris[,,2], iris[,,3]) km <- kmeans(x, irismean) wrong <- km$cluster!=rep(1:3, c(50, 50, 50)) spin(x, highlight=wrong) plot(x[,2], x[,3], type="n") text(x[!wrong, 2], x[!wrong, 3], km$cluster) # identify cluster membership that is correct points(x[wrong, 2], x[wrong, 3], pch=15) # boxes for points in error title(main="K-Means Clustering of the Iris Data")