Energy Minimization and Clustering

This project began as an offshoot of the Voronoi project. Given a set of abstract data values, which we can think of as vectors in an N dimensional space, we want to try to organize them into clusters. We know that if we specify that we want to use K clusters, then there is a solution that minimizes the "energy" of the clustering in a certain sense.

To be specific, we suppose we have a set of N points X(I), which we want to organize into K clusters C(J) with "centers" or "representative points" Z(J). This results in an "energy"

    E(K) = sum ( for each cluster C(J) ) 
             sum ( for each point X(I) in C(J) ) 
               distance ( X(I), Z(J) )**2

This energy is essentially the sum of the square of the lengths of "roads" from the center point to each of its "suburbs". Of course, for a given number of clusters K, there are many possibly clusterings, but for a fixed K, there must be a minimum possible energy value Emin(K). Emin(K) is a monotonic decreasing function of K, and in particular, Emin(N) = 0.

For our first example, we just want to try to sketch out some code that will work, and make sure it runs properly. To start with, let's assume we've picked a value for K, and we want to compute the Voronoi clusters. The code might look something like this:

    function [ C, Energy ] = K_Clusters ( K, X )

    Initialize by setting C(1) = X(1),...C(K) = X(K).

    Do

      For each X, find the nearest C(I).
      
      Replace C(I) by the average of all the points that were closest to C(I)

      Compute the new energy

      If the difference between the new energy and the previous one is "small",
      then quit

    End

Try this data on the sample data set 1. Try making a plot of the energy as you vary K from 1 to 10.

Repeat the computation, but now choose 10 points randomly in [0,10] by [0,5]. What does your energy plot look like?

Compute the energy graph for the set of 100 points in sample data set 2 (and here's a PDF image of the set. Try making a plot of the energy as you vary K from 1 to 100. You should see that the energy curve has several sudden drops as you increase K. But if the clustering for this set doesn't show up in the energy diagram, try the tighter sample data set 3.

Data sets in a 100 x 100 box. We are interested in the energy computations as the number of clusters varies from 1 to 100.

Data set 2, 100 clustered points.
Data set 3, 100 tightly clustered points.
Data set 4, 100 loosely clustered points.
Data set 5, 100 random points.
Data set 6, 100 tightly clustered points plus 30 random points.

For several sets of data, we have chosen a range of cluster values K, and tried to determine the minimum energy Emin(K) using clustering techniques. Our iteration technique isn't perfect; using 20 iterations can give a significantly lower energy than 10. Using different starting points for the cluster centers can result in different clusters and energies. We're sure our technique is flawed because sometimes we get a higher minimum energy when we increase the number of clusters, which can't be right.

Question One: the form of the Emin Curve: For several sets of 100 random points in [0,100] x [0,100], we've plotted Emin(K) and seen a graph that suggests a hyperbola. It's not a hyperbola, because it's exactly 0 at K = N and beyond. And of course, the graph we get depends on the random points we select, but not by much.

One topic of interest is the shape of the energy curve for random data. Here are some points to investigate:

If we solve the same problem, but use 200 or 400 random points, does the Emin(K) curve have the same shape, once we rescale the X axis?
If we solve the problem with 100 random points in a [0,50] by [0,200] space, what happens to the Emin(K) curve?
Solve the problem with 100 random points in a 1D, 2D and 3D region. Divide each set of energy data by its maximum value, and then plot all three energy curves together. Do they lie on top of each other, or is there some strong trend in their behavior?
Repeat the previous exercise, but now use 10 random points in 1D, 100 random points in 2D and 1000 random points in 3D. Rescale the energy as before, but now also rescale the value of K to K/KMAX, so that all the data fits in the unit box. Do the 1D, 2D and 3D curves correspond at all?
Can you approximate the Emin(K) curve using inverse powers of K, or logarithms? Does the same formula work for several different versions of the problem?

As part of this investigation, you'll need to make data sets in 1D, 2D, or 3D, containing points that are random, clustered, or a combination of both. To help with this process, copy dataset.m, a MATLAB M file that can generate datasets for you.

Question Two: the effect of clustering. On the interval [0,100], we will compare the energy of three sets of points:

100 uniform random points;
100 points in 3 clusters
80 points in 3 clusters, plus 20 random "noise" points.

To make the clustered data set, use the following input:

    boxmin = 0
    boxmax = 100
    noise = 0
    ncluster = 100
    center = [ 20, 60, 80 ]
    spread = [ 10, 5, 2 ]
    x = dataset ( boxmin, boxmax, noise, ncluster, center, spread )

For the noisy data set, use the following two changed values:

    noise = 20
    ncluster = 80

For each dataset, compute the energy for K = 1 through 10 clusters. Compare the graphs. We expect the energy to be lower for the clustered data sets.

To get started on writing your report, I've made a brief LaTeX outline.

Last modified on 01 August 2001.