%  outline_jason.tex
%  16 July 2001
%
\documentclass{article}

\title{REU Report Outline}
\author{Jason Grinblat}
\date{\today}

\begin{document}

\maketitle

\begin{abstract}
This is an outline of what your report should look like.  Your abstract
should summarize your work in two or three sentences.
\end{abstract}

\tableofcontents

\section{Introduction}

Your introduction should be an overview of the whole project, and a summary
of the report itself.  Roughly, you might try to write a paragraph 
here corresponding to each section that follows.

\section{The Setting}

Explain the setting in which this problem came up, namely, the probabalistic
computation of a Centroidal Voronoi Tessellation, using sampling.  It is
not necessary to discuss the properties of CVT.  Just point out that,
we carry out ITMAX iterations, that we have NG generators, and on each
step we consider NG * NS sample points ( or whatever these names are),
so that, essentially, we need to find the nearest neighbors of
ITMAX * NG * NS points.

Here, it might be good to include, as an illustration, the nearest neighbor
MATLAB code, so people see what you're talking about.

We think of NG as defining the size of the problem, and we notice that
the program tends to spend most of its time computing nearest neighbors,
so now we ask, what is the cost of the nearest neighbor operation overall,
and for a single point?

We discover that, for our simplest algorithm, one nearest neighbor costs
us NG operations, meaning the total work is of the order of 
$ITMAX * NG^2 * NS$.

To give an idea of what this means, it might be helpful at this point to
show a typical computation that we did using MATLAB, with a plot of
NG increasing from 100 to 200 to 400 ... and suggesting that problems of
size 1000 would be very hard, and 10,000 would take an enormous amount of
time.

Conclude this section by saying that we intend to investigate our options
for speeding up the nearest neighbor search.  Note that we started out hoping
for a simple exact routine with no preprocessing - we didn't even know what
that meant, but that as we searched, we found that we were willing to do
some preprocessing if necessary, we were willing to have the routine
be inaccurate, as long as we could control that, and we were willing to
put up with SOME complexity, but not too much.  We also hoped for an
algorithm that would work the same for 2D, 3D, any-D, but in a pinch
we'd take a good 2D code. 

Another important point
was being able to find an algorithm we could understand or get documentation
on (this is an issue with the DSEARCH routine in MATLAB, and the
"documentation", which says, go look at the QHULL program).

\section{The DSEARCH Routine}

Here, discuss the mysterious DSEARCH routine.  Show how you have to rewrite
your nearest neighbor code.  Explain how the documentation says you can
run faster if you set up and save a particular sparse matrix.  I have
figured out what this sparse matrix represents, so check with me and we
can go over it.  

Include the new versions of the nearest neighbor code, perhaps giving them
names like CODE2 and CODE3 or some other simple key.

Now we have something to compare.  So make a table comparing CODE1, CODE2
and CODE3 on problems of size 100, 200, 400, 800 to see if the time
doubles or quadruples.  Maybe make a graph of all three sets of data.

Mention the good things about this routine: it runs fast; it runs the same
for 2D, 3D, any-D.  The bad thing: we have little idea how it works.
Plus, there's the overhead of computing the Delaunay triangulation,
although we can estimate this work as taking NG * log ( NG ), so that
we do this ITMAX * NG * log ( NG ).  

Note another funny thing about MATLAB: many MATLAB routines, such as DSEARCH,
are written in C, and so execute very much faster than regular MATLAB code.
This makes it a little hard to conduct fair comparisons of two codes, if
one is written in (interpreted) MATLAB code, and the other is compiled.
That's another problem with DSEARCH - part of its big speed advantage might
be simply that it's using compiled code.

\section{The Spiral Method}

Although we didn't implement this routine, discuss it.  Explain that the
idea is simple, and works in any dimenstion.

Discuss how the implementation can be tricky.  In particular, talk about
the many different things that can happen:
* there are no points in the first cell you check;
* there is one point in the first cell you check;
* there are many points in the first cell you check;
and how, once you have a candidate for nearest neighbor, you have to
look at a few more cells to make sure you've looked everywhere.

Try to estimate the amount of preprocessing (because we have to have a list,
for each cell, of the generators contained in that cell). 

Also point out that the idea of using an underlying rectangular grid
was something that suggested other methods to us.

\section{The Grid Method}

(This might be CODE4)

Here, discuss the very simple grid method we had, which set up a grid,
computed the nearest neighbor for each grid point, and then figured out
which grid box each point landed in, and computed the nearest neighbor
by choosing the nearest neighbor of one of the four corners.

Show that this algorithm takes O(1) time to compute the distance for a
single point.  Show that there is some inaccuracy, based on the grid size.
Explain that we have left out the preprocessing cost, which, for a 2D problem
requires us to compute, for each of NC cells, the nearest neighbor from the
set of NG points.  And if we want an accurate answer, NC is roughly equal
to NG, which means we end up doing $NG^2$ preprocessing work.  In other
words, we end up doing just as much work, and getting an inaccurate answer.

Show some of your results about the inaccuracy of the algorithm, and
how you could control it by increasing the number of cells.  Mention
that this is a good thing.

Show some timing results, total time, and itemized into preprocessing time
and processing time.

I believe we convinced ourselves that for certain kinds of problems, this 
approach might be very good.  Do you recall what we said?


\section{The Grid with Sweeping Preprocessing}

(This might be CODE5).

Here, talk about the code where we do the preprocessing, but try to
control its cost.  We can't guarantee that this will work - we can think
of examples of input data that would make it fail.  But explain what
we were thinking, and how we can hope that, on average, the preprocessing
takes less time than before.  Note that we still have some inaccuracies
in our computation.  Discuss how they can be controlled.

Again, give some results on inaccuracy and on timing, compared to CODE1.

\section{The Delaunay Triangle Neighbor Search}

(This might be CODE6)

Based on what I've found out about Delaunay search, let's try to set
up a code that does more or less what the MATLAB DSEARCH routine does,
for a 2D problem.  

We're only going to do a simple version, which works as follows.  Given
a point (X,Y), pick a triangle at random.  You can determine whether the
point is in this triangle.  If not, you can determine at least one side
of the triangle that you should "jump over" to find a new triangle to
examine.  Eventually, you find a triangle containing (X,Y).

There's a preprocessing step for this code.  We need to consider a triangle.
We need to consider a side of that triangle.  We need to know the index
of the OTHER triangle that shares that side.  This list will allow us
to "exit" the triangle by moving to its neighbor.  The list can be 
computed from the triangle list that DELAUNAY gives us.  I am trying to
write down the details.

Explain what's going on in this code.  Show that the number of triangles
is of the same order as NG, and that, for a 2D problem, we'd expect
that two random points might be separated by sqrt(NG) triangles, so that
getting to the correct triangle would take sqrt(NG) steps, not NG.

Our version of this code will not always get the nearest neighbor.  
Explain why.  Discuss what would have to be done to guarantee that the
nearest neighbor is found.

\section{The Delaunay Node Neighbor Search}

It turns out that the Delaunay Triangle Neighbor Search, as we've
described it, only finds the triangle containing a point.  The vertices
of this triangle aren't guaranteed to be the nearest neighbors.  Here's
a variation on the algorithm which might work better.

Suppose we have a Delauany triangulation.  Suppose we have, for every
point P in the triangulation, a list of all the "neighbors", that is,
all the other points that are connected directly to P by a side of
a Delaunay triangle.

Then the algorithm works as follows:

  we pick a node N at random, and compute its distance to (X,Y).
 
  we look at all the neighbors of this node, and find the closest one.

  If all the neighbors are further away than N is, we accept N as the
  nearest neighbor and stop.  

  Otherwise, we take the neighbor of N that was closer to (X,Y), and
  call that our new value of N, and repeat the steps.

This algorithm requires some preprocessing, because the MATLAB Delaunay
code only gives us the list of nodes that make up triangles; we have
to build the node neighbor list ourselves.  I know how to do that,
and can show you.  

The time complexity of the preprocessing is NG log NG, I believe, and the
cost of one distance calculation is roughly 6 * sqrt ( NG ), assuming that
on average, each node has 6 Delaunay node neighbors, and that two nodes
in the plane are, on average, about sqrt ( NG ) apart.

I believe that this algorithm will get the nearest neighbor most of the time.
I think an important factor here is that we're using a Delaunay triangulation,
otherwise sometimes you'd get stuck far away from (X,Y).

I haven't convinced myself that it will ALWAYS get the nearest neighbor.
There are some tricky cases that might occur.

\section{Discussion}

Here you can talk about the "philosophy" of what went on.  A computer
program that solves problems of varying size N can be analyzed to estimate
how much time it will take to solve a problem, based on its size.  A growth
rate of $N^2$ can be undesirable, and can quickly limit the problems you can
do.  In some cases, you can identify a small part of the computation that
is causing the program to work this slowly. 

In some cases, it is possible to simply replace the slow algorithm by a
completely equivalent, but much faster one.  Such options exist in sorting,
for instance.  The new code may be more complicated, may require more
storage, and preprocessing.

In our case, there is not nearly so much known or available.  We are willing
to accept an approximate answer.  We are willing to do some preprocessing
and to set up some extra arrays.  We very much want to reduce the cost
of the computation from O(NG) to O(1), O(sqrt(NG)) or something like that.

\begin{thebibliography}{99}

\bibitem{Bentley}, the Jon Bentley Paper

\bibitem{Others}, some of those articles I found

\end{thebibliography}

\end{document}