Energy Minimization and Clustering


This project began as an offshoot of the Voronoi project. Given a set of abstract data values, which we can think of as vectors in an N dimensional space, we want to try to organize them into clusters. We know that if we specify that we want to use K clusters, then there is a solution that minimizes the "energy" of the clustering in a certain sense.

To be specific, we suppose we have a set of N points X(I), which we want to organize into K clusters C(J) with "centers" or "representative points" Z(J). This results in an "energy"

    E(K) = sum ( for each cluster C(J) ) 
             sum ( for each point X(I) in C(J) ) 
               distance ( X(I), Z(J) )**2
  
This energy is essentially the sum of the square of the lengths of "roads" from the center point to each of its "suburbs". Of course, for a given number of clusters K, there are many possibly clusterings, but for a fixed K, there must be a minimum possible energy value Emin(K). Emin(K) is a monotonic decreasing function of K, and in particular, Emin(N) = 0.

For our first example, we just want to try to sketch out some code that will work, and make sure it runs properly. To start with, let's assume we've picked a value for K, and we want to compute the Voronoi clusters. The code might look something like this:

    function [ C, Energy ] = K_Clusters ( K, X )

    Initialize by setting C(1) = X(1),...C(K) = X(K).

    Do

      For each X, find the nearest C(I).
      
      Replace C(I) by the average of all the points that were closest to C(I)

      Compute the new energy

      If the difference between the new energy and the previous one is "small",
      then quit

    End
  
Try this data on the sample data set 1. Try making a plot of the energy as you vary K from 1 to 10.

Repeat the computation, but now choose 10 points randomly in [0,10] by [0,5]. What does your energy plot look like?

Compute the energy graph for the set of 100 points in sample data set 2 (and here's a PDF image of the set. Try making a plot of the energy as you vary K from 1 to 100. You should see that the energy curve has several sudden drops as you increase K. But if the clustering for this set doesn't show up in the energy diagram, try the tighter sample data set 3.

Data sets in a 100 x 100 box. We are interested in the energy computations as the number of clusters varies from 1 to 100.

For several sets of data, we have chosen a range of cluster values K, and tried to determine the minimum energy Emin(K) using clustering techniques. Our iteration technique isn't perfect; using 20 iterations can give a significantly lower energy than 10. Using different starting points for the cluster centers can result in different clusters and energies. We're sure our technique is flawed because sometimes we get a higher minimum energy when we increase the number of clusters, which can't be right.

Question One: the form of the Emin Curve: For several sets of 100 random points in [0,100] x [0,100], we've plotted Emin(K) and seen a graph that suggests a hyperbola. It's not a hyperbola, because it's exactly 0 at K = N and beyond. And of course, the graph we get depends on the random points we select, but not by much.

One topic of interest is the shape of the energy curve for random data. Here are some points to investigate:

As part of this investigation, you'll need to make data sets in 1D, 2D, or 3D, containing points that are random, clustered, or a combination of both. To help with this process, copy dataset.m, a MATLAB M file that can generate datasets for you.

Question Two: the effect of clustering. On the interval [0,100], we will compare the energy of three sets of points:

To make the clustered data set, use the following input:
    boxmin = 0
    boxmax = 100
    noise = 0
    ncluster = 100
    center = [ 20, 60, 80 ]
    spread = [ 10, 5, 2 ]
    x = dataset ( boxmin, boxmax, noise, ncluster, center, spread )
  
For the noisy data set, use the following two changed values:
    noise = 20
    ncluster = 80
  
For each dataset, compute the energy for K = 1 through 10 clusters. Compare the graphs. We expect the energy to be lower for the clustered data sets.

To get started on writing your report, I've made a brief LaTeX outline.


Last modified on 01 August 2001.