point_merge, a FORTRAN77 code which deals with the problem of counting or indexing the unique or "tolerably unique" points in a collection of N points in M dimensional space.
This problem is distinct from, though similar to, problems such as finding the nearest neighbor, or counting all the points that lie within a given distance of each point, or finding the optimal assignment of N points into K clusters (the K-Means problem).
The "tolerably unique" problem is the "Starbucks problem", that is, the task of choosing a list of Starbucks cafes to shut down, so that there is no Starbucks cafe across the street from another one. The Starbucks cafes that remain open are "tolerably unique", that is, there is now no other open cafe within the given tolerance.
Given sets of data with some points very close to each other, there are a number of ways of resolving the data. Here, a simpleminded approach is taken, in which we start with one tolerably unique point, and consider the remaining points one at a time, accepting the next point as long as it is not closer than the tolerance to some already accepted point.
This is a simpler approach than trying to maximize the number of points you can have in the set, while satisfying the tolerance, or of trying to replace two nearby points by their average, for instance.
For the unique case, in 1D, a simple and efficient procedure sorts the data, and then compares consecutive entries. For the unique case in multiple dimensions, the sorting procedure can still be used.
For the "tolerably unique" case in 1D, the same sorting procedure can be used, but in multiple dimensions, the usual kinds of lexicographic sorting will interleave near and far points in a way that is hard to deal with.
A reliable method for the tolerably unique case in multiple dimensions is simply to compute the distance between every pair of points. However, this is an O(N^2) computation, and becomes terribly unsuitable when the number of points considered is in the tens of thousands or more.
The "radial" approach, implemented in POINT_RADIAL_TOL_UNIQUE_COUNT, picks a random base point Z, computes the radial distance R(I) of each point P(I) to Z, and then sorts the data by R. It then counts tolerably unique items by inspecting the R array in order. Two points are possible neighbors only if they lie within a TOL interval in R. Assuming the points are in general position, the number of points that need to be compared will be small enough that this algorithm is essentially O(N) rather than O(N^2).
In MATLAB, the unique command can select the unique points; there is also a user-written function called consolidator that can merge points with a tolerance.
The computer code and data files described and made available on this web page are distributed under the GNU LGPL license.
point_merge is available in a C version and a C++ version and a FORTRAN77 version and a FORTRAN90 version and a MATLAB version.
ann, a C++ library which computes Approximate Nearest Neighbors, by David Mount, Sunil Arya;
CITIES, a dataset directory which contains sets of information about cities and the distances between them;
CITIES, a FORTRAN90 library which handles various problems associated with a set of "cities" on a map.
KMEANS, a FORTRAN90 library which contains several different algorithms for the K-Means problem.
SPAETH, a FORTRAN90 library which can cluster data according to various principles.
SPAETH2, a FORTRAN90 library which can cluster data according to various principles.
TABLE_MERGE, a FORTRAN90 program which reads a file of N points in M dimensions, removes duplicates or points that are closer than some tolerance, and writes the reduced set of points to a file.