MathGroup Archive 2009

[Date Index] [Thread Index] [Author Index]

Search the Archive

Re: Re: Finding Clusters

  • To: mathgroup at smc.vnet.net
  • Subject: [mg104740] Re: [mg104644] Re: Finding Clusters
  • From: Leonid Shifrin <lshifr at gmail.com>
  • Date: Sun, 8 Nov 2009 06:48:10 -0500 (EST)
  • References: <200911030751.CAA01018@smc.vnet.net> <hcr78s$8c4$1@smc.vnet.net>

Carl,

another follow-up:

I profiled my code and found that the main bottleneck is the built-in
GatherBy function (who would think?). It happens to be sub-optimal for some
problems and can be beaten in some cases (below is an example). With the
following modifications:


Clear[getTree, listSplit, gatherBy, getComponentsNew];
getTree =
  Compile[{{pairs, _Integer, 2}},
   Module[{t = 0, i = 0, j = 0, xl = 0, yl = 0, k = 0,
     len = Max[pairs], dad = Table[0, {Max[pairs]}]},
    For[k = 1, k <= Length[pairs], k++,
     xl = i = pairs[[k, 1]];
     yl = j = pairs[[k, 2]];
     If[xl == yl, Continue[]];
     While[dad[[i]] > 0, i = dad[[i]]];
     While[dad[[j]] > 0, j = dad[[j]]];
     While[dad[[xl]] > 0, t = xl; xl = dad[[xl]]; dad[[t]] = i];
     While[dad[[yl]] > 0, t = yl; yl = dad[[yl]]; dad[[t]] = j];
     If[i != j,
      If[dad[[j]] <=      dad[[i]],
        dad[[j]] += dad[[i]] - 1;
        dad[[i]] = j;,
        (*else*)
        (dad[[i]] += dad[[j]] - 1);
        dad[[j]] = i];];];
     For[k = 1, k <= len, k++, If[dad[[k]] <= 0, dad[[k]] = k]];
    dad]];

listSplit[x_List, lengths_List] :=
  MapThread[Take[x, {##}] &, {Most[#], Rest[#] - 1}] &@
   Accumulate[Prepend[lengths, 1]];

gatherBy[lst_List, flst_List] :=
  listSplit[lst[[Ordering[flst]]], (Sort@Tally[flst])[[All, 2]]];

getComponentsNew[pairs_] :=
  With[{dad = getTree[pairs]},
   gatherBy[Range[Length[dad]], FixedPoint[dad[[#]] &, dad]]];


my code is now on par with yours in terms of performance:

aggs[n_, pairs_] :=
 Module[{sp, t}, sp = SparseArray[Thread[pairs -> 1], {n, n}];
  t = Sign[sp + Transpose[sp]];
  SparseArray`StronglyConnectedComponents[t]]


In[1]:= Clear[resL, resC];
a = 0.7;
trials = Table[
   RandomInteger[{1, k = 2^n}, {Ceiling[a k], 2}],
   {n, 11, 19}];

In[2]:=
testL =
Table[With[{g = trials[[n]]},    {Length@g, First@Timing[resL[n] =
getComponentsNew@g]}], {n, 1,   Length@trials}];

In[3]:=
 testC = Table[With[{g = trials[[n]]},
    {Length@g, First@Timing[resC[n] = aggs[Max[g], g]]}], {n, 1,
    Length@trials}];

In[4]:= testL

Out[4]= {{1434, 0.01}, {2868, 0.03}, {5735, 0.05}, {11469, 0.14}, {22938,
0.241}, {45876, 0.42}, {91751, 0.992}, {183501, 2.093}, {367002,   4.256}}

In[5]:= testC

Out[5]= {{1434, 0.01}, {2868, 0.04}, {5735, 0.07}, {11469, 0.13}, {22938,
0.301}, {45876, 0.571}, {91751, 1.281}, {183501, 2.855}, {367002,   5.337}}

In[6]:=
Sort[Sort /@ resL[#]] === Sort[Sort /@ resC[#]] & /@
 Range[Length[trials]]

Out[6]= {True, True, True, True, True, True, True, True, True}

I wish the compiler could somehow handle non-tensor structures like lists of
lists, or that Reap-Sow were somehow compilable and fast when compiled -
despite all my efforts, the most of the execution time in my code is spent
on the dumb task of collecting elements with the same tag in a list
(gatherBy), because I can not implement a compiled version.

For very large lists I expect your code to start winning because some parts
of mine are based on sorting which is N*log N rather than linear.


Regards,
Leonid



2009/11/7 Carl Woll <carlw at wolfram.com>

> Szabolcs Horv=E1t wrote:
>
>> On 2009.11.04. 7:34, Fred Simons wrote:
>>
>>
>>> Here is a very short, very fast but not very simple solution:
>>>
>>> components[lst_List] := Module[{f},
>>>   Do[Set @@ f /@ pair, {pair, lst}];   GatherBy[Union @@ lst, f]]
>>>
>>>
>>>
>>
>> I enjoyed Fred Simons's solution tremendously.
>>
>> I tried to speed it up a bit.
>>
>> I compared the speed of components[] with the speed of WeakComponents
>> (from the GraphUtilities package) for random graphs (e.g.
>> RandomInteger[50000, {30000, 2}]).  It seems that components[] is faster
>> than WeakComponents for as long as the graph doesn't have very large
>> connected components.  However, as soon as large connected components
>> appear, components[] slows down a lot.
>>
>> I looked into the source of WeakComponents to find out how it works, but
>> it turns out it uses undocumented functions, such as
>> SparseArray`StronglyConnectedComponents
>>
>> The reason for the slowdown of components[] when large connected
>> components are present is that the f[] function needs to be evaluated in
>> several steps.  E.g. for the graph {{1,2},{2,3},{3,4}}, the definition o=
f f
>> will include f[1]=f[2], f[2]=f[3], f[3]=f[4], so the evaluation of=
 f[1] will
>> take 3 steps.
>>
>> I tried to remedy this by changing f so that it re-defines itself each
>> time the left-hand-side of a particular definition can be evaluated furt=
her.
>>  With the above example, evaluating f[1] would cause the definition of f=
[1]
>> to change from f[1]=f[2] to f[1]=f[4] (as f[2] evaluates to f[4]).  =
Here's
>> the solution:
>>
>> setSpecial[lhs_, rhs_] /; rhs =!= lhs :=
>>   (lhs := With[{val = #1}, lhs := #0[val]; val] &[rhs])
>>
>> components2[lst_List] :=
>>  Module[{f},
>>   Do[setSpecial @@ f /@ pair, {pair, lst}];
>>   GatherBy[Union @@ lst, f]
>>  ]
>>
>> This modified components2[] seems to be faster than WeakComponents[] eve=
n
>> for single-component random graphs, however, it is limited by
>> $RecursionLimit (which can't be increased indefinitely without risking a
>> crash)
>>
>> Szabolcs
>>
>> P.S. Here's the code I used to compare the speed of components[] and
>> WeakComponents[].  For 'a' greater than about 0.5 components[] gets slow=


  • Prev by Date: Re: Re: Finding Clusters
  • Next by Date: Re: Re: Passing function arguments as lists of
  • Previous by thread: Re: Finding Clusters
  • Next by thread: Re: Finding Clusters