Re: Re: Finding Clusters

*To*: mathgroup at smc.vnet.net*Subject*: [mg104740] Re: [mg104644] Re: Finding Clusters*From*: Leonid Shifrin <lshifr at gmail.com>*Date*: Sun, 8 Nov 2009 06:48:10 -0500 (EST)*References*: <200911030751.CAA01018@smc.vnet.net> <hcr78s$8c4$1@smc.vnet.net>

Carl, another follow-up: I profiled my code and found that the main bottleneck is the built-in GatherBy function (who would think?). It happens to be sub-optimal for some problems and can be beaten in some cases (below is an example). With the following modifications: Clear[getTree, listSplit, gatherBy, getComponentsNew]; getTree = Compile[{{pairs, _Integer, 2}}, Module[{t = 0, i = 0, j = 0, xl = 0, yl = 0, k = 0, len = Max[pairs], dad = Table[0, {Max[pairs]}]}, For[k = 1, k <= Length[pairs], k++, xl = i = pairs[[k, 1]]; yl = j = pairs[[k, 2]]; If[xl == yl, Continue[]]; While[dad[[i]] > 0, i = dad[[i]]]; While[dad[[j]] > 0, j = dad[[j]]]; While[dad[[xl]] > 0, t = xl; xl = dad[[xl]]; dad[[t]] = i]; While[dad[[yl]] > 0, t = yl; yl = dad[[yl]]; dad[[t]] = j]; If[i != j, If[dad[[j]] <= dad[[i]], dad[[j]] += dad[[i]] - 1; dad[[i]] = j;, (*else*) (dad[[i]] += dad[[j]] - 1); dad[[j]] = i];];]; For[k = 1, k <= len, k++, If[dad[[k]] <= 0, dad[[k]] = k]]; dad]]; listSplit[x_List, lengths_List] := MapThread[Take[x, {##}] &, {Most[#], Rest[#] - 1}] &@ Accumulate[Prepend[lengths, 1]]; gatherBy[lst_List, flst_List] := listSplit[lst[[Ordering[flst]]], (Sort@Tally[flst])[[All, 2]]]; getComponentsNew[pairs_] := With[{dad = getTree[pairs]}, gatherBy[Range[Length[dad]], FixedPoint[dad[[#]] &, dad]]]; my code is now on par with yours in terms of performance: aggs[n_, pairs_] := Module[{sp, t}, sp = SparseArray[Thread[pairs -> 1], {n, n}]; t = Sign[sp + Transpose[sp]]; SparseArray`StronglyConnectedComponents[t]] In[1]:= Clear[resL, resC]; a = 0.7; trials = Table[ RandomInteger[{1, k = 2^n}, {Ceiling[a k], 2}], {n, 11, 19}]; In[2]:= testL = Table[With[{g = trials[[n]]}, {Length@g, First@Timing[resL[n] = getComponentsNew@g]}], {n, 1, Length@trials}]; In[3]:= testC = Table[With[{g = trials[[n]]}, {Length@g, First@Timing[resC[n] = aggs[Max[g], g]]}], {n, 1, Length@trials}]; In[4]:= testL Out[4]= {{1434, 0.01}, {2868, 0.03}, {5735, 0.05}, {11469, 0.14}, {22938, 0.241}, {45876, 0.42}, {91751, 0.992}, {183501, 2.093}, {367002, 4.256}} In[5]:= testC Out[5]= {{1434, 0.01}, {2868, 0.04}, {5735, 0.07}, {11469, 0.13}, {22938, 0.301}, {45876, 0.571}, {91751, 1.281}, {183501, 2.855}, {367002, 5.337}} In[6]:= Sort[Sort /@ resL[#]] === Sort[Sort /@ resC[#]] & /@ Range[Length[trials]] Out[6]= {True, True, True, True, True, True, True, True, True} I wish the compiler could somehow handle non-tensor structures like lists of lists, or that Reap-Sow were somehow compilable and fast when compiled - despite all my efforts, the most of the execution time in my code is spent on the dumb task of collecting elements with the same tag in a list (gatherBy), because I can not implement a compiled version. For very large lists I expect your code to start winning because some parts of mine are based on sorting which is N*log N rather than linear. Regards, Leonid 2009/11/7 Carl Woll <carlw at wolfram.com> > Szabolcs Horv=E1t wrote: > >> On 2009.11.04. 7:34, Fred Simons wrote: >> >> >>> Here is a very short, very fast but not very simple solution: >>> >>> components[lst_List] := Module[{f}, >>> Do[Set @@ f /@ pair, {pair, lst}]; GatherBy[Union @@ lst, f]] >>> >>> >>> >> >> I enjoyed Fred Simons's solution tremendously. >> >> I tried to speed it up a bit. >> >> I compared the speed of components[] with the speed of WeakComponents >> (from the GraphUtilities package) for random graphs (e.g. >> RandomInteger[50000, {30000, 2}]). It seems that components[] is faster >> than WeakComponents for as long as the graph doesn't have very large >> connected components. However, as soon as large connected components >> appear, components[] slows down a lot. >> >> I looked into the source of WeakComponents to find out how it works, but >> it turns out it uses undocumented functions, such as >> SparseArray`StronglyConnectedComponents >> >> The reason for the slowdown of components[] when large connected >> components are present is that the f[] function needs to be evaluated in >> several steps. E.g. for the graph {{1,2},{2,3},{3,4}}, the definition o= f f >> will include f[1]=f[2], f[2]=f[3], f[3]=f[4], so the evaluation of= f[1] will >> take 3 steps. >> >> I tried to remedy this by changing f so that it re-defines itself each >> time the left-hand-side of a particular definition can be evaluated furt= her. >> With the above example, evaluating f[1] would cause the definition of f= [1] >> to change from f[1]=f[2] to f[1]=f[4] (as f[2] evaluates to f[4]). = Here's >> the solution: >> >> setSpecial[lhs_, rhs_] /; rhs =!= lhs := >> (lhs := With[{val = #1}, lhs := #0[val]; val] &[rhs]) >> >> components2[lst_List] := >> Module[{f}, >> Do[setSpecial @@ f /@ pair, {pair, lst}]; >> GatherBy[Union @@ lst, f] >> ] >> >> This modified components2[] seems to be faster than WeakComponents[] eve= n >> for single-component random graphs, however, it is limited by >> $RecursionLimit (which can't be increased indefinitely without risking a >> crash) >> >> Szabolcs >> >> P.S. Here's the code I used to compare the speed of components[] and >> WeakComponents[]. For 'a' greater than about 0.5 components[] gets slow=

**References**:**Finding Clusters***From:*fd <fdimer@gmail.com>

**Re: Re: Finding Clusters**

**Re: Re: Passing function arguments as lists of**

**Re: Finding Clusters**

**Re: Finding Clusters**