MathGroup Archive: August 2008 [00585]

[Date Index] [Thread Index] [Author Index]

Re: Re: Help to remove equivalent (redundant) solutions

To: mathgroup at smc.vnet.net
Subject: [mg91544] Re: [mg91500] Re: Help to remove equivalent (redundant) solutions
From: Daniel Lichtblau <danl at wolfram.com>
Date: Tue, 26 Aug 2008 03:29:58 -0400 (EDT)
References: <g8o858$l6a$1@smc.vnet.net> <200808241105.HAA15487@smc.vnet.net>

mark mcclure wrote:
> On Aug 23, 1:45 am, Bob Hanlon <hanl... at cox.net> wrote:
>> Use SameTest option in Union
> 
> I've had to think about this question at some length in my
> work with fractal geometry, where I deal with large numbers
> of points.  In this context, there are some issues with the
> use of SameTest.  In particular, Union[S] first sorts its
> elements in n*log(n) time and then compares in linear time.
> Union[S, SameTest -> f] cannot assume any ordering so it
> must compare pairwise, running in quadratic time.  If at all
> possible, it is much faster to express each element in S in
> some canonical form, then apply Union without the SameTest.
> I've got some examples below but would be very curious to
> know if anyone has improvements.
> 
> Here's a list of 400 pairs of points with each pair very
> close to another.  Note that Union works very quickly, but
> distinguishes between close pairs of points.
> 
> ------
> points = RandomReal[{0, 1}, {200, 2}];
> points = Join[points, points + 10^(-15)];
> Union[points] // Length // Timing
> 
> {0.000159, 400}
> ------
> 
> 
> We can apply Bob's solution which, correctly, treats close
> points the same but runs much longer.
> 
> ------
> Union[points, SameTest -> (Rationalize[#1, 10^(-10)] ===
>     Rationalize[#2, 10^(-10)] &)] // Length // Timing
> 
> {4.36683, 200}
> ------
> 
> 
> Note that  SameTest -> (Norm[#1 - #2] < 10^(-10) &) is
> faster, but we can get the best of both worlds by
> rationalizing first and then applying the Union.
> 
> -----
> Union[Rationalize[points, 10^(-10)]] // Length // Timing
> 
> {0.04629, 200}
> -----
> 
> 
> Of course, there are issues with Rationalize, or Round, as
> well.  First, you lose some precision in the output, which
> might or might not be important for your application.  Also,
> while it seems to work well with randomly generated data,
> it's easy to devise situations where points close together
> Rationalize to different results.  Thus, I'd be leary of
> using Rationalize in this manner for serious work.  What
> is needed is a reliable function that places things in a
> canonical form.  Here's one such function I developed based
> on clustering.
> 
> -----
> Needs["HierarchicalClustering`"];
> canonicalFunction[nonCanonicalValues_List] := Module[
>    {toCanonical},
>    Quiet[heirarchy = Agglomerate[N[nonCanonicalValues],
>       DistanceFunction -> EuclideanDistance,
>       Linkage -> "Average"]];
>    segregate[Cluster[cl1_, cl2_, d_, _, _], tol_] :=
> 
>     MyClusters[cl1, cl2] /; d > tol;
>    segregate[mine_MyClusters, tol_] := segregate[#, tol] & /@ mine;
>    segregate[x_, _] := x;
>    cf[cl_Cluster] := ClusterFlatten[cl];
>    cf[x_] := {x};
>    clusters =
>     cf /@ List @@ Flatten[FixedPoint[segregate[#, 10^-12] &,
>         MyClusters[heirarchy]]];
>    canonicalValues = Chop[First /@ clusters];
>    toCanonical[x_] := First[Nearest[canonicalValues][x]];
>    toCanonical];
> -----
> 
> We use it as follows.
> 
> -----
> toCan = canonicalFunction[points];
> Union[toCan /@ points] // Length // Timing
> 
> {0.138291, 200}
> -----
> 
> Again, I'd love to see alternatives.
> 
> Mark McClure


Here is a way to get a representative from an "equivalence" class, using 
Nearest to remove from consideration all elements within a given radius 
of that representative. It's fuzzy, and in a serious way: not only will 
the result depend on the order of the input points, but even the size of 
the result might depend on it. This is due to nontransitivity of the 
"equivalence" relation (so of course it is not a true equivalence 
relation). Anyway, the up side to this is in the speed. It will tend to 
be O(n log(n)), provided Nearest behaves itself.

removeDuplicatesInRange[points_, eps_] :=  Module[
   {keepers, len=Length[points], nearfunc, unusedlist, nearpts},
   nearfunc = Nearest[points];
   keepers = ConstantArray[False,len];
   Map [(unusedlist[#]=True)&, points];
   Do [
     If [TrueQ[unusedlist[points[[j]]]],
       keepers[[j]] = True;
       nearpts = Rest[nearfunc[points[[j]],{len,N[eps]}]];
	  Map [(unusedlist[#]=False)&, nearpts];
       ],
	{j,len}];
   Clear[usedlist];
   Pick[points, keepers]
   ]

Example:

len = 10^5;
disp = 10^(-4);
eps = 10^(-3);
points = RandomReal[1, {len,2}];
points = Join[points, points + RandomReal[disp, {len,2}],
   points - 3*RandomReal[disp, {len,2}]];

Timing[Length[np = removeDuplicatesInRange[points,eps]]]

This should run in ten seconds or so.

Daniel Lichtblau
Wolfram Research

References:
- Re: Help to remove equivalent (redundant) solutions from
  - From: mark mcclure <mcmcclur@unca.edu>

Prev by Date: Re: Eliminating intermediate results

Next by Date: Re: Eliminating intermediate results

Previous by thread: Re: Help to remove equivalent (redundant) solutions from

Next by thread: Partial differential equation with evolving boundary conditions