asa136


asa136, a C code which divides M points in N dimensions into K clusters so that the within-clusters sum of squares is minimized, by Hartigan and Wong.

This is Applied Statistics Algorithm 136.

In the K-Means problem, a set of N points X(I) in M-dimensions is given. The goal is to arrange these points into K clusters, with each cluster having a representative point Z(J), usually chosen as the centroid of the points in the cluster. The energy of each cluster is

        E(J) = Sum ( all points X(I) in cluster J ) || X(I) - Z(J) ||^2
      

For a given set of clusters, the total energy is then simply the sum of the cluster energies E(J). The goal is to choose the clusters in such a way that the total energy is minimized. Usually, a point X(I) goes into the cluster with the closest representative point Z(J). So to define the clusters, it's enough simply to specify the locations of the cluster representatives.

This is actually a fairly hard problem. Most algorithms do reasonably well, but cannot guarantee that the best solution has been found. It is very common for algorithms to get stuck at a solution which is merely a "local minimum". For such a local minimum, every slight rearrangement of the solution makes the energy go up; however a major rearrangement would result in a big drop in energy.

A simple algorithm for the problem is known as "H-Means". It alternates between two procedures:

These steps are repeated until no points are moved, or some other termination criterion is reached.

A more sophisticated algorithm, known as "K-Means", takes advantage of the fact that it is possible to quickly determine the decrease in energy caused by moving a point from its current cluster to another. It repeats the following procedure:

This procedure is repeated until no points are moved, or some other termination criterion is reached.

Licensing:

The computer code and data files described and made available on this web page are distributed under the MIT license

Languages:

asa136 is available in a C version and a C++ version and a FORTRAN90 version and a MATLAB version.

Related Data and Programs:

asa058, a C code which carries out the K-means algorithm for clustering data.

asa113, a C code which implements the Banfield and Bassill clustering algorithm using transfers and swaps.

asa136_test

cities, a dataset directory which contains a number of city distance datasets.

spaeth, a dataset directory which contains test data for clustering.

spaeth2, a dataset directory which contains test data for clustering.

Author:

Original FORTRAN77 version by John Hartigan, Manchek Wong; C version by John Burkardt.

Reference:

  1. John Hartigan, Manchek Wong,
    Algorithm AS 136: A K-Means Clustering Algorithm,
    Applied Statistics,
    Volume 28, Number 1, 1979, pages 100-108.
  2. Wendy Martinez, Angel Martinez,
    Computational Statistics Handbook with MATLAB,
    pages 373-376,
    Chapman and Hall / CRC, 2002.
  3. David Sparks,
    Algorithm AS 58: Euclidean Cluster Analysis,
    Applied Statistics,
    Volume 22, Number 1, 1973, pages 126-130.

Source Code:


Last revised on 27 May 2019.