Re: Finding Clusters

*To*: mathgroup at smc.vnet.net*Subject*: [mg104698] Re: Finding Clusters*From*: Leonid Shifrin <lshifr at gmail.com>*Date*: Sat, 7 Nov 2009 06:47:58 -0500 (EST)*References*: <200911030751.CAA01018@smc.vnet.net>

Hi all, I came pretty late to the party, but let me add my voice to those delighted by Fred's solution and Szabolcs's improvement. These are two of the most brilliant and mind-blowing hacks I ever saw. I would just like to add to this, that let us not underestimate the power of Compile. Here is an implementation of weight-balancing path-compression union-find algorithm taken straight from the Sedgewick's book, with some minor changes as needed for adoption to Mathematica: Clear[getTree]; getTree = Compile[{{pairs, _Integer, 2}}, Module[{t = 0, i = 0, j = 0, xl = 0, yl = 0, k = 0, len = Max[pairs], dad = Table[0, {Max[pairs]}], present = Table[-1, {Max[pairs]}]}, For[k = 1, k <= Length[pairs], k++, xl = i = pairs[[k, 1]]; yl = j = pairs[[k, 2]]; present[[i]] = present[[j]] = 1; If[xl == yl, Continue[]]; While[dad[[i]] > 0, i = dad[[i]]]; While[dad[[j]] > 0, j = dad[[j]]]; While[dad[[xl]] > 0, t = xl; xl = dad[[xl]]; dad[[t]] = i]; While[dad[[yl]] > 0, t = yl; yl = dad[[yl]]; dad[[t]] = j]; If[i != j, If[dad[[j]] <= dad[[i]], If[dad[[i]] > 0, dad[[j]] = dad[[j]] + dad[[i]] - 1]; dad[[i]] = j;, (* else *) If[dad[[j]] > 0, (dad[[i]] = dad[[i]] + dad[[j]] - 1)]; dad[[j]] = i]; ]; ]; For[k = 1, k <= len, k++, If[dad[[k]] == 0, dad[[k]] = k]]; {dad, present}]]; Clear[getComponents]; getComponents[pairs_] := Module[{dad, present}, {dad, present} = getTree[pairs]; Select[ GatherBy[ Transpose[{FixedPoint[dad[[#]] &, dad], Range[Length[dad]]}], First][[All, All, 2]], Length[#] > 1 || present[[First@#]] == 1 &]]; It would need less code and run faster yet if we didn't allow same-vertex pairs like {1,1}. It is also limited by memory consumption - it allocates the list of dimension equal to the maximum vertex label, so it will be a waste if vertex set is a sparse array of large numbers. Also, vertex labels must be strictly positive. Both of the last two limitations can be dealt with by some sort of re-labeling - I did not include that. Here is Szabolcs's comparison where I added timing for my version and did some cosmetic changes (not that zero vertex numbers are not generated, unlike the original comparison of Szabolcs): In[1]:= Clear[res1,res2,res0,i0,i1,i2]; i1=0;i2=0;i0=0; In[2]:= data = Table[RandomInteger[{1,n},{Ceiling[a n],2}],{n,2^Range[11,16]}]; In[3]:= tc=(First@Timing[res0[i0++]=components[#]]&)/@data Out[3]= {0.032,0.078,0.297,1.156,4.547,19.562} In[4]:= tc2=(First@Timing[res1[i1++]=components2[#]]&)/@data Out[4]= {0.047,0.109,0.235,0.5,1.046,2.36} In[5]:= tc3=(First@Timing[res2[i2++]=getComponents[#]]&)/@data Out[5]= {0.,0.031,0.031,0.078,0.188,0.391} In[6]:= ListLogLogPlot[{tc,tc2,tc3},Joined->True,PlotMarkers->Automatic] In[7]:= res0[#]===res1[#]===res2[#]&/@Range[0,5] Out[7]= {True,True,True,True,True,True} Here <components> corresponds to Fred's solution, and <components2> to Szabolcs's. I have a feeling that the present compiled version and Szabolcs's one are doing similar type of things (at least in spirit), by attempting to flatten the emerging element trees at run-time. The speed-up achieved by compiled version I attribute to the large time constant associated with hash setting/lookup (assignments and DownValue definition lookup as compared to compiled version of array indexing), plus very high efficiency of Part when many elements are extracted at once - the fact that I exploit in the FixedPoint construct. Regards, Leonid On Fri, Nov 6, 2009 at 1:17 PM, Fred Simons <f.h.simons at tue.nl> wrote: > I feel I have to remark that with respect to me there is nothing > brilliant in the solution I posted. The idea behind it was the result of > a discussion in this group, many, many years ago, on a similar problem. > But anyway, it is a very beautiful result! > > The real brilliant thing is the improvement that was given by Szabolcs > Horvat in [mg104644]. > > Fred > > > DrMajorBob wrote: > > Brilliant as usual, Fred. > > > > I did think intervals only needed to overlap to "correspond", whereas > your > > solution requires them to share an end-point. > > > > For instance, in this example the first interval is a subset of the > second > > interval and overlaps with the third, yet "components' associates it with > > neither. > > > > event = {{1, 2} + 1/2, {1, 3}, {3, 4}, {5, 6}, {7, 8}, {8, 10}}; > > components@event > > > > {{1, 3, 4}, {3/2, 5/2}, {5, 6}, {7, 8, 10}} > > > > I can't really guess the OP's intent. > > > > Bobby > > > > On Wed, 04 Nov 2009 00:33:25 -0600, Fred Simons <f.h.simons at tue.nl> > wrote: > > > > > >> Here is a very short, very fast but not very simple solution: > >> > >> components[lst_List] := Module[{f}, > >> Do[Set @@ f /@ pair, {pair, lst}]; GatherBy[Union @@ lst, f]] > >> > >> Fred Simons > >> Eindhoven University of Technology > >> > >>> All. > >>> > >>> I have a list which represents some natural event. These events are > >>> listed pair-wise, which corresponds to event happening within certain > >>> time interval from each other, as below > >>> > >>> event={{1,2},{1,3},{3,4},{5,6},{7,8},{8,10}} > >>> > >>> I wish to find a thread of events, i.e. if event A is related to B and > >>> B to C, I wish to group {A,B,C} together. For the example above I > >>> would have > >>> > >>> {{1,2,3,4},{5,6},{7,8,10}} > >>> > >>> This would correspond to do a Graph Plot and identifying the parts > >>> which are disconnected It should be simple but I'm really finding it > >>> troublesome. > >>> > >>> > >>> > >>> > > > > > > > > >

**References**:**Finding Clusters***From:*fd <fdimer@gmail.com>