Count or Index Unique or Tolerably Unique Points

**POINT_MERGE**
is a MATLAB library which
deals with the problem of counting or indexing the unique or
"tolerably unique" points in a collection of N points in
M dimensional space.

This problem is distinct from, though similar to, problems such as finding the nearest neighbor, or counting all the points that lie within a given distance of each point, or finding the optimal assignment of N points into K clusters (the K-Means problem).

The "tolerably unique" problem is the "Starbucks problem", that is, the task of choosing a list of Starbucks cafes to shut down, so that there is no Starbucks cafe across the street from another one. The Starbucks cafes that remain open are "tolerably unique", that is, there is now no other open cafe within the given tolerance.

Given sets of data with some points very close to each other, there are a number of ways of resolving the data. Here, a simpleminded approach is taken, in which we start with one tolerably unique point, and consider the remaining points one at a time, accepting the next point as long as it is not closer than the tolerance to some already accepted point.

This is a simpler approach than trying to maximize the number of points you can have in the set, while satisfying the tolerance, or of trying to replace two nearby points by their average, for instance.

For the unique case, in 1D, a simple and efficient procedure sorts the data, and then compares consecutive entries. For the unique case in multiple dimensions, the sorting procedure can still be used.

For the "tolerably unique" case in 1D, the same sorting procedure can be used, but in multiple dimensions, the usual kinds of lexicographic sorting will interleave near and far points in a way that is hard to deal with.

A reliable method for the tolerably unique case in multiple dimensions is simply to compute the distance between every pair of points. However, this is an O(N^2) computation, and becomes terribly unsuitable when the number of points considered is in the tens of thousands or more.

The "radial" approach, implemented in **POINT_RADIAL_TOL_UNIQUE_COUNT**,
picks a random base point Z, computes the radial distance R(I) of each point
P(I) to Z, and then sorts the data by R. It then counts tolerably unique
items by inspecting the R array in order. Two points are possible
neighbors only if they lie within a TOL interval in R. Assuming the
points are in general position, the number of points that need to be
compared will be small enough that this algorithm is essentially O(N)
rather than O(N^2).

In MATLAB, the **unique** command can select the unique points;
there is also a user-written function called **consolidator**
that can merge points with a tolerance.

The computer code and data files described and made available on this web page are distributed under the GNU LGPL license.

**POINT_MERGE** is available in
a C version and
a C++ version and
a FORTRAN77 version and
a FORTRAN90 version and
a MATLAB version.

ANN, a C++ library which computes Approximate Nearest Neighbors, by David Mount, Sunil Arya;

ANN_TEST, a C++ program which uses ann to approximate the nearest neighbors of a set of points stored in a file;

CITIES, a dataset directory which contains sets of information about cities and the distances between them;

CITIES, a MATLAB library which handles various problems associated with a set of "cities" on a map.

KMEANS, a MATLAB library which contains several different algorithms for the K-Means problem.

SPAETH, a FORTRAN90 library which can cluster data according to various principles.

SPAETH2, a FORTRAN90 library which can cluster data according to various principles.

TABLE_MERGE, a MATLAB program which reads a file of N points in M dimensions, removes duplicates or points that are closer than some tolerance, and writes the reduced set of points to a file.

- i4_uniform.m, returns a pseudorandom I4 in a given range;
- i4_print.m, prints an I4VEC;
- point_radial_tol_unique_count.m, counts tolerably unique points.
- point_radial_tol_unique_index.m, indexes tolerably unique points.
- point_radial_unique_count.m, counts unique points.
- point_tol_unique_count.m, counts tolerably unique points.
- point_tol_unique_index.m, indexes tolerably unique points.
- point_unique_count.m, counts unique points.
- point_unique_index.m, indexes unique points.
- r8col_duplicates.m, generates an R8COL with some duplicate columns;
- r8col_sort_heap_index_a.m, computes an index vector to ascending sort an R8COL;
- r8mat_transpose_print.m, prints the transpose of an R8MAT;
- r8mat_transpose_print_some.m, prints some of the transpose of an R8MAT;
- r8mat_uniform_01.m, returns a unit pseudorandom R8MAT;
- r8vec_compare.m, compares the order of two R8VEC's;
- r8vec_print.m, prints an R8VEC;
- r8vec_sort_heap_index_a.m, computes an index vector to ascending sort an R8VEC;
- r8vec_uniform_01.m, returns a unit pseudorandom R8VEC.
- timestamp.m, prints the current YMDHMS date as a timestamp;

- point_merge_test.m, calls all the tests.
- point_merge_test01.m, tests uniqueness counting with no tolerance.
- point_merge_test02.m, tests uniqueness counting with a tolerance.
- point_merge_test03.m, compares timings for two uniqueness counters.
- point_merge_test04.m, tests uniqueness indexing with a tolerance.
- point_merge_test05.m, times uniqueness indexing with a tolerance.
- point_merge_test_output.txt, the output file.

You can go up one level to the MATLAB source codes.