%  outline_joe.tex
%  04 July 2001
%
\documentclass{article}

\title{REU Report Outline}
\author{Joe Koopmeiners}
\date{\today}

\begin{document}

\maketitle

\begin{abstract}
This is an outline of what your report should look like.  Your abstract
should summarize your work in two or three sentences.
\end{abstract}

\tableofcontents

\section{Introduction}

Your introduction should be an overview of the whole project, and a summary
of the report itself.  Roughly, you might try to write a paragraph 
here corresponding to each section that follows.

\section{The Clustering Problem}

Discuss some of the problems in which clustering is used.  For instance,
in biology, we believe that various species are more or less related;
we have some scientific basis for making the groupings.

In other cases, we may believe there are patterns in a set of data, but
we may not understand very much about the data at all.  Just in order
to try to interpret the data, we want to be able to try to organize it
into groups.  Here you might mention the genetic data we've been looking
at, and the stock market data, both of which have this property that
we don't know much at all about them.

This opens several questions, including how many groups you want to use;
how you define the groups (that is, how you pick a label or property that
the group members will share); how you decide which group to put an element
into; and whether, at the end, you can measure how good your grouping is.

\section{The Voronoi Tessellation}

Explain what a Voronoi Tessellation is.  The idea of "tessellation" should
be used for problems where the set of points is infinite.

Explain what a Centroidal Voronoi Tessellation is.  A picture or two might
make a difference.

We started out looking at points in the plane.  Now we want to consider
a finite set of points, in 1D, 2D, 3D, or perhaps in 14D.  For a finite
set of points, we use the word "clustering" instead of "tessellations".

Discuss how, for the tesselation problem, we give the generator a special
status, and how it's good for the generator and the centroid to be close.

Discuss how, for the clustering problem, we can sometimes use the average
of the data points as a special point; in other cases, the "special point"
has to actually be one of the points in the clustering.

\section{Energy}

Explain the concept of the energy of a tessellation (which is the sum
of the energy integrals over each separate Voronoi region) or 
a clustering (which is the sum of the energy sums over each separate cluster).

Explain how the energy integral or sum is based on the distance of the
points to the special point, and how, if the special point is both the
centroid or average, and the generator of the tesselation or cluster,
we get an energy minimization property.  (Some of this is discussed in
Professor Gunzburger's talk).

As an illustration of this, you might take a set of 20 points in the plane,
and break them up into 3 sets:

A) by choosing 3 points randomly, and then randomly adding the other points
   to one of the three sets;

B) by choosing 3 points randomly, and then adding the other points to whichever
   is closest.

C) by doing your clustering algorithm on the data.

  Compute the energy of the three configurations, and explain that in this
case, it's equal to the sum of the lengths of the lines connecting the
special points to their group members.

\section{The Clustering Algorithm}

Write up the algorithm used to group a set of finite data into clusters.
Explain the problems of getting a starting point and of how the results 
can depend on the starting point.  Explain how the algorithm takes several
steps, and how you decide when to stop.  Show a "before" and "after" plot
of your initial clusters, and your final clusters, and compare the energies.

\section{The Energy of Random Data}

Here, explain that if we're going to look for clustering patterns, we need
to know what energy data looks like when there's no pattern.  Explain
how the energy curve must decrease as we increase the number of clusters,
reaching zero for sure when every point has its own cluster.  

Now we compute energy for random data in 1D, 2D, and so on and try to 
see a pattern.  Explain how we guessed the shapes of the energy 
curves in each dimension, and then used MATLAB's least squares fitting
to get a formula, and ran the problem with more points to check that
the formula behaves better and better.

\section{The Energy of Clustered Data}

Now show the results for the simple clustered data cases.  Explain why
we are plotting the data "backwards", plotting $\frac{1}{K}$
 or $\frac{1}{K^2}$ or $\frac{1}{K^3}$
versus Energy, so that we can expect to get a straight line.

Explain why the energy curve is lower, and why it has to eventually get 
back to the random data curve.

Explain that this helps us to determine whether the data we are looking at
is random, or clustered, and even suggests the number of clusters we might
want to try.

\section{Results}

Perhaps here you can take the genetic data, and the stock market data, 
and try to answer the questions: is this random data? (no).  What do we
expect the energy diagram to look like for random data in this space?
What do we see for this data?  Is there a number of clusters that is
reasonable for this data?

\section{Discussion}

We'll figure out what to say here later.  

\begin{thebibliography}{99}

\bibitem{?}, we can look up some of Professor Gunzburger's papers.  We should
also try to hunt down any explanations for the energy minimizing property
of the Voronoi tessellation.

\end{thebibliography}

\end{document}