Big Data K-Means Clustering

DESCRIPTION:

Returns a list representing a clustering of the data into k clusters.

This function requires the bigdata library section to be loaded.

USAGE:

bdCluster(x, columns=NULL, k=10, iter.max=10, retain=10000,
    start="firstSample")

REQUIRED ARGUMENTS:

x
bdFrame, data.frame, or matrix of values to be clustered.

OPTIONAL ARGUMENTS:

columns
names of columns to use in clustering. Default is to use all columns.
k
number of clusters.
iter.max
maximum number of iterations.
retain
number of rows in the retained set. As each block of data is processed, observations that do not cluster well are kept in the retain set. At the next step in the algorithm, the observations are added to the new chunk of data and the K-means clustering is run on this combined set.
start
method for selecting starting values for centers. Specify "firstSample" to use a random sample of K rows from the first block of data as the initial centers. Specify "kPoints" to use the first unique K rows of data as the initial centers. Specify "hClustFirstBlock" to compute the initial centers from the first block of dataset using the hierarchical clustering method. Specify "entireSample" to compute the initial centers from a sample of the entire dataset using the hierarchical clustering method.

VALUE:

An object of class bdCluster with the following components:
centers
matrix like the input centers containing the locations of the final cluster centers. Each row is a cluster center location.
sizes
vector giving the number of observations assigned to each cluster.
call
the call to bdCluster.
bdModel
a bdModel object used by predict.bdCluster to compute predictions on new data.
bdPredictions
a bdFrame containing cluster membership and distance from cluster center for each row.

DETAILS:

K-means is one of the most widespread clustering methods. It was originally developed for situations in which all variables are continuous, and the Euclidian distance is chosen as the measure of dissimilarity. There are several variants of the K-means clustering algorithm, but most variants involve an iterative scheme that operates over a fixed number of clusters while attempting to satisfy the following properties:

* Each class has a center which is the mean position of all the samples in that class.



* Each object is in the class whose center it is closest to.



The Big Data library clustering function applies a K-Means algorithm that performs a single scan of a data set, while using a buffer for points from the data set of fixed size. Categorical data is handled by expanding categorical columns into m indicator columns, where m is the number of unique categories in the column. The K-Means algorithm selects k of the objects, each of which initially represents a cluster mean or centroid. For each of the remaining objects, an object is assigned to the cluster it resembles the most, based on the distance of the object from the cluster mean. It then computes the new mean for each cluster. This process iterates until the function converges. A second scan through the data assigns each observation to the cluster it is closest to, where closeness is measured by the Euclidean distance.

When you perform K-means clustering, the number of cluster itertions you specify determines the accuracy of each cluster. That is, the higher the iteration number, the more accurate the observations.

REFERENCES:

Hartigan, J. A. (1975). Clustering Algorithms. New York: Wiley.

Hartigan, J. A. and Wong, M. A. (1979). A k-means clustering algorithm. Applied Statistics 28, 100-108.

SEE ALSO:

, , .

EXAMPLES:

x <- bdCluster(state.x77, 4)