RUNNING BIG PROBLEMS
25 April 2001

To get an idea of how hard it is to solve a problem with roughly a million
generators, we set up a sequence of problems, each about 10 times larger,
and recorded the required CPU time.

To get a quicker result, we reduced the number of CVT iteration loops to 1,
and the number of sampling points per generator to 100.  We also did a
moment calculation, which is about the same work as 1 CVT iteration, so
effectively, we're timing 2 CVT iterations.

We used bins to compute the nearest neighbors.  The problem had physical
dimensions of 100 x 100 x 20.  We used bins that were cubes.  

The number of bins had to be increased as the problem size increased.
At first, we were going to use 25 x 25 by 5 bins for all problems, but 
we quickly discovered that the execution time increased dramatically.
Then we realized that one factor in the computation is the average number
of points in a nonempty bin, since this is roughly the number of points that
have to be compared in the nearest neighbor search.  By adjusting our bins
as the problem grew so that this value was roughly constant, we got our
execution time to grow as we expected, roughly linear with number of generators.

Here are the numbers:

Generators       Bins        Points/Nonempty Bin  CPU seconds   CPU Hours
----------  --------------   -------------------  -----------------------
  5,000      25 x  25 x  5             7                150 s = 0.04 h
 50,000      50 x  50 x 10             9               2222 s = 0.62 h
500,000     100 x 100 x 20            13              29030 s = 8.06 h

As an illustration of what can happen if you don't increase the number of
bins, we compare two runs:

Generators       Bins        Points/Nonempty Bin  CPU seconds   CPU Hours
----------  --------------   -------------------  -----------------------
 50,000      25 x  25 x  5            69               6977 s = 1.93 h
 50,000      50 x  50 x 10             9               2222 s = 0.62 h