Compute Dissimilarities

A data set for clustering can consist of a data set containing rows of observations or a dissimilarity object storing measures of dissimilarities between observations. K-means, partitioning around medoids using the large data algorithm, and monothetic clustering operate on a data set. Partitioning around medoids, fuzzy clustering, and the hierarchical methods take either a data set or dissimilarity object.

The clustering routines themselves do not accept non-numeric variables. If a data set contains non-numeric variables such as factors, either these must be converted to numeric variables or dissimilarities must be used.

How we compute the dissimilarity between two objects depends on the type of the original variables. By default, numeric columns are treated as interval-scaled variables, factors are treated as nominal variables, and ordered factors are treated as ordinal variables.

To calculate dissimilarities

Choose Statistics __image\arrow5.gif Cluster Analysis __image\arrow5.gif Compute Dissimilarities. The dialog shown below appears.

__image\compdiss.gif

The Compute Dissimilarities dialog has the following options:

Data

Data Set

Select a data set from the dropdown list or type the name of a data set. You can also type into the Data Set edit field any expression that evaluates to a data set.

Clustering Variables

Select numeric variables from the dropdown list. If your data set contains factor variables, use the Compute Dissimilarities dialog to create dissimilarity objects to be used in the cluster analysis. However dissimilarity objects cannot be used in K-Means or Monothetic clustering.

Subset Rows

Enter an S-PLUS expression that identifies the rows to use in the analysis. To use all the rows in the data set, leave this field blank.

Omit Rows with Missing Values

Select this box to omit from the analysis any rows in the data set that contain missing values for any of the variables in the model.

Dissimilarity Measure

Metric Select the metric to be used for calculating dissimilarities between objects. The available options are euclidean and manhattan. Euclidean distances are root sum-of-squares of differences, and manhattan distances are the sum of absolute differences. If Data Set is already a dissimilarity matrix, then this argument is ignored.

Standardize Variables Select this to standardize each data column by subtracting the variable's mean value and dividing by the variable's mean absolute deviation. If Data Set is already a dissimilarity matrix, then this argument is ignored.

Special Variable

Ordinal Ratio Select variables to be treated as ordinal ratio variables.

Log Ratio Select variables to be treated as log ratio variables.

Asymmetric Binary Select variables to be treated as asymmetric binary variables.

Save Model Object

In the Save As field, enter the name for the object in which to save the results of the analysis. If an object with this name already exists, its contents are overwritten. The model object can be used in later functions such as plotting.

Related programming language functions

daisy