Hartigan's K-Means Clustering

DESCRIPTION:

Returns a list representing a clustering of the data with as many groups as there are rows in centers.

USAGE:

kmeans(x, centers, iter.max=10) 

REQUIRED ARGUMENTS:

x
matrix of multivariate data. Each row corresponds to an observation, and each column corresponds to a variable. Missing values are not accepted.
centers
matrix of initial guesses for the cluster centers, or integer giving the number of clusters. If centers is an integer, hclust and cutree will be used to get initial values. If centers is a matrix, each row represents a cluster center, and thus centers must have the same number of columns as x. The number of rows in centers, (there must be at least two), is the number of clusters that will be formed. Missing values are not accepted.

OPTIONAL ARGUMENTS:

iter.max
maximum number of iterations.

VALUE:

An object of class kmeans with the following components:
cluster
vector of integers, ranging from 1 to nrow(centers), with length the same as the number of rows of x. The ith value indicates the cluster in which the ith data point belongs.
centers
matrix like the input centers containing the locations of the final cluster centers. Each row is a cluster center location.
withinss
vector of length nrow(centers). The ith value gives the within cluster sum of squares for the ith cluster.
size
vector of length nrow(centers). The ith value gives the number of data points in cluster i.

DETAILS:

The object is to find a partition of the observations with nrow(centers) groups that minimizes sum(withinss). To actually guarantee the minimum would be computationally infeasible in many settings; this function finds a local minimum, that is, a solution such that there is no single switch of an observation from one group to another group that will decrease the objective. The procedure used to achieve the local minimum is rather complex - see Hartigan and Wong (1979) for details.

It may be necessary to scale the columns of x in order for the clustering to be sensible. The larger a variable's variance, the more important it will be to the clustering.

When deciding on the number of clusters, Hartigan (1975, pp 90-91) suggests the following rough rule of thumb. If k is the result of kmeans with k groups and kplus1 is the result with k+1 groups, then it is justifiable to add the extra group when

(sum(k$withinss)/sum(kplus1$withinss)-1)*(nrow(x)-k -1)

is greater than 10.

REFERENCES:

Hartigan, J. A. (1975). Clustering Algorithms. New York: Wiley.

Hartigan, J. A. and Wong, M. A. (1979). A k-means clustering algorithm. Applied Statistics 28, 100-108.

SEE ALSO:

, .

EXAMPLES:

irismean <- t(apply(iris, c(2, 3), 'mean')) 
x <- rbind(iris[,,1], iris[,,2], iris[,,3]) 
km <- kmeans(x, irismean)  
wrong <- km$cluster!=rep(1:3, c(50, 50, 50)) 
spin(x, highlight=wrong) 
plot(x[,2], x[,3], type="n") 
text(x[!wrong, 2], x[!wrong, 3], km$cluster)   
   # identify cluster membership that is correct 
points(x[wrong, 2], x[wrong, 3], pch=15) 
   # boxes for points in error 
title(main="K-Means Clustering of the Iris Data")