MathGroup Archive: November 2009 [00279]

[Date Index] [Thread Index] [Author Index]
Re: Finding Clusters
To: mathgroup at smc.vnet.net
Subject: [mg104698] Re: Finding Clusters
From: Leonid Shifrin <lshifr at gmail.com>
Date: Sat, 7 Nov 2009 06:47:58 -0500 (EST)
References: <200911030751.CAA01018@smc.vnet.net>
Hi all,

I came pretty late to the party, but let me add my voice to those delighted
by Fred's solution and Szabolcs's improvement. These are two of the most
brilliant and mind-blowing hacks I ever saw.

I would just like to add to this, that let us not underestimate the power of
Compile. Here is an implementation of  weight-balancing path-compression
union-find algorithm taken straight from the Sedgewick's book, with some
minor changes as needed for adoption to Mathematica:

Clear[getTree];
getTree =
  Compile[{{pairs, _Integer, 2}},
   Module[{t = 0, i = 0, j = 0, xl = 0, yl = 0, k = 0,
     len = Max[pairs], dad = Table[0, {Max[pairs]}],
     present = Table[-1, {Max[pairs]}]},
    For[k = 1, k <= Length[pairs], k++,
     xl = i = pairs[[k, 1]];
     yl = j = pairs[[k, 2]];
     present[[i]] = present[[j]] = 1;
     If[xl == yl, Continue[]];
     While[dad[[i]] > 0, i = dad[[i]]];
     While[dad[[j]] > 0, j = dad[[j]]];
     While[dad[[xl]] > 0, t = xl; xl = dad[[xl]]; dad[[t]] = i];
     While[dad[[yl]] > 0, t = yl; yl = dad[[yl]]; dad[[t]] = j];
     If[i != j,
      If[dad[[j]] <= dad[[i]],
        If[dad[[i]] > 0, dad[[j]] = dad[[j]] + dad[[i]] - 1];
        dad[[i]] = j;,
        (* else *)
        If[dad[[j]] > 0, (dad[[i]] = dad[[i]] + dad[[j]] - 1)];
        dad[[j]] = i];
      ];
     ];
    For[k = 1, k <= len, k++,
     If[dad[[k]] == 0, dad[[k]] = k]];
    {dad, present}]];


Clear[getComponents];
getComponents[pairs_] :=
  Module[{dad, present},
   {dad, present} = getTree[pairs];
   Select[
    GatherBy[
      Transpose[{FixedPoint[dad[[#]] &, dad], Range[Length[dad]]}],
      First][[All, All, 2]],
    Length[#] > 1 || present[[First@#]] == 1 &]];

It would need less code and run faster yet if we didn't allow same-vertex
pairs like {1,1}. It is also limited by memory consumption - it allocates
the list of dimension equal to the maximum vertex label, so it will be a
waste if vertex set is a sparse array of large numbers.  Also, vertex labels
must be strictly positive. Both of the last two limitations can be dealt
with by some sort of re-labeling  - I did not include that.

Here is Szabolcs's comparison where I added timing for my version and did
some cosmetic  changes (not that zero vertex numbers are not generated,
unlike  the original comparison of Szabolcs):

In[1]:=
Clear[res1,res2,res0,i0,i1,i2];
i1=0;i2=0;i0=0;

In[2]:=
data = Table[RandomInteger[{1,n},{Ceiling[a n],2}],{n,2^Range[11,16]}];

In[3]:=
tc=(First@Timing[res0[i0++]=components[#]]&)/@data

Out[3]= {0.032,0.078,0.297,1.156,4.547,19.562}

In[4]:=
tc2=(First@Timing[res1[i1++]=components2[#]]&)/@data

Out[4]= {0.047,0.109,0.235,0.5,1.046,2.36}

In[5]:=
tc3=(First@Timing[res2[i2++]=getComponents[#]]&)/@data

Out[5]= {0.,0.031,0.031,0.078,0.188,0.391}

In[6]:= ListLogLogPlot[{tc,tc2,tc3},Joined->True,PlotMarkers->Automatic]

In[7]:=
res0[#]===res1[#]===res2[#]&/@Range[0,5]

Out[7]= {True,True,True,True,True,True}

Here <components> corresponds to Fred's solution, and <components2> to
Szabolcs's.

I have a feeling that the present compiled version and Szabolcs's one are
doing similar type of things (at least in spirit), by attempting to flatten
the emerging element trees at run-time. The speed-up achieved by compiled
version I attribute to the large time constant associated with hash
setting/lookup (assignments and DownValue definition lookup as compared to
compiled version of array indexing), plus very high efficiency of Part when
many elements are extracted at once - the fact that I exploit in the
FixedPoint construct.

Regards,
Leonid




On Fri, Nov 6, 2009 at 1:17 PM, Fred Simons <f.h.simons at tue.nl> wrote:

> I feel I have to remark that with respect to me there is nothing
> brilliant in the solution I posted. The idea behind it was the result of
> a discussion in this group, many, many years ago, on a similar problem.
> But anyway, it is a very beautiful result!
>
> The real brilliant thing is the improvement that was given by Szabolcs
> Horvat in [mg104644].
>
> Fred
>
>
> DrMajorBob wrote:
> > Brilliant as usual, Fred.
> >
> > I did think intervals only needed to overlap to "correspond", whereas
> your
> > solution requires them to share an end-point.
> >
> > For instance, in this example the first interval is a subset of the
> second
> > interval and overlaps with the third, yet "components' associates it with
> > neither.
> >
> > event = {{1, 2} + 1/2, {1, 3}, {3, 4}, {5, 6}, {7, 8}, {8, 10}};
> > components@event
> >
> > {{1, 3, 4}, {3/2, 5/2}, {5, 6}, {7, 8, 10}}
> >
> > I can't really guess the OP's intent.
> >
> > Bobby
> >
> > On Wed, 04 Nov 2009 00:33:25 -0600, Fred Simons <f.h.simons at tue.nl>
> wrote:
> >
> >
> >> Here is a very short, very fast but not very simple solution:
> >>
> >> components[lst_List] := Module[{f},
> >>   Do[Set @@ f /@ pair, {pair, lst}];   GatherBy[Union @@ lst, f]]
> >>
> >> Fred Simons
> >> Eindhoven University of Technology
> >>
> >>> All.
> >>>
> >>> I have a list which represents some natural event. These events are
> >>> listed pair-wise, which corresponds to event happening within certain
> >>> time interval from each other, as below
> >>>
> >>> event={{1,2},{1,3},{3,4},{5,6},{7,8},{8,10}}
> >>>
> >>> I wish to find a thread of events, i.e. if event A is related to B and
> >>> B to C, I wish to group {A,B,C} together. For the example above I
> >>> would have
> >>>
> >>> {{1,2,3,4},{5,6},{7,8,10}}
> >>>
> >>> This would correspond to do a Graph Plot and identifying the parts
> >>> which are disconnected It should be simple but I'm really finding it
> >>> troublesome.
> >>>
> >>>
> >>>
> >>>
> >
> >
> >
>
>
>
References:
- Finding Clusters
  - From: fd <fdimer@gmail.com>
Prev by Date: Re: How to make floating control elements?
Next by Date: Re: find subsets
Previous by thread: Re: Re: Finding Clusters
Next by thread: Re: Finding Clusters