KMeans Milestone

http://people.sc.fsu.edu/~jburkardt/classes/urop_2016/kmeans_milestone.html

The K-Means problem is a warmup exercise for the tasks we will need to do involving geometry. The K-Means task starts with a collection of N items of data, and asks for the "best" way of sorting them into K groups, so that the items in each group are "close" to each other.

This means we need to

numerically measure how "closely related" any two items are;
numerically measure how "good" each group of items is;
numerically measure how "good" the total collection of groups is;
determine an algorithm that will get us a good grouping

Let G(J) be the J-th of the K groups. We would expect that any item X(I) in this group is there because it is close to the other items. We can measure the closeness of two items using Euclidean distance. It turns out to be expensive to compare every pair of items, so instead we represent each group by its average value, and measure closeness relative to that. Let M(J) be the average of the items in group G(J). We say the "point energy" E(I,J) associated with item X(I) in group G(J) is (X(I)-M(J))^2 (that's the Euclidean distance squared.) The "group energy" E(J) is then simply the sum of the point energies of all the points in the group. There are technical reasons why squaring is the right thing to do.

If E(J) is the energy of the J-th group, then let E, the "total energy" be the sum of the group energies E(J) for 1 <= J <= K. We will use the energy value as an indicator of goodness of grouping. Given two possible groupings, we will prefer the one with smaller total energy.

The algorithm we will use is a weak version of the K-means algorithm. Begin by assigning each point to a group, in any way you want (But it is helpful to avoid having any empty groups). Now iterate the following steps:

For each group G(J), compute the average M(J), and the group energy E(J).
Now allow the points X to change groups: for every X(I), compute the point energy (X(I)-M(J)) over all groups, and move X(I) to the group for which this energy is smallest.
If no points changed group, exit with success (local minimum).
If the total energy didn't go down by very much, exit.
If the number of iterations exceeds some maximum, exit.
If any group is empty, repair it somehow.

Topics:

Verify that for a set of points X, the average M minimizes the energy E(X,M)=sum(1<=I<=N)(X(I)-M)^2.
Given a set of points, find some procedure to assign each of them into just one of K groups.
Compute the averages of each group.
Compute the point, group, and total energies for a grouping.
Determine which points should be switched to a different group.
Determine how many points moved, and how much the energy decreased.
Decide how to repair groups that become empty...or is it impossible for a group to become empty?

Last revised on 28 October 2016.