MathGroup Archive: November 2005 [00035]

[Date Index] [Thread Index] [Author Index]
Re: Re: Re: Re: aggregation of related elements in a list
To: mathgroup at smc.vnet.net
Subject: [mg61867] Re: [mg61793] Re: [mg61762] Re: [mg61730] Re: aggregation of related elements in a list
From: "Carl K. Woll" <carl at woll2woll.com>
Date: Wed, 2 Nov 2005 04:09:38 -0500 (EST)
References: <djcgcs$6du$1@smc.vnet.net> <djfnjr$b1i$1@smc.vnet.net> <200510270902.FAA19490@smc.vnet.net> <200510280725.DAA08769@smc.vnet.net> <200510300443.AAA09785@smc.vnet.net>
Sender: owner-wri-mathgroup at wolfram.com
danl at wolfram.com wrote:
>>[From Carl Woll:]
>>I decided to compare my approach and another which I thought up
>>previously with Maxim's approach in a bit more detail below. My
>>conclusion was that it should be possible to do things quite a bit
>>quicker using SparseArray objects instead of using the
>>ConnectedComponents function from Combinatorica.
>>
>>Maxim wrote:
>>
>>>On Sun, 23 Oct 2005 10:11:07 +0000 (UTC), Carl K. Woll
>>><carlw at u.washington.edu> wrote:
>>>
>>>
>>>
>>>>leigh pascoe wrote:
>>>>
>>>>
>>>>>Dear Mathgroup,
>>>>>
>>>>>I would like to construct a function to transform a list of integers
>>>>>into a list of lists of related members. Suppose we have the finite
>>>>>list
>>>>>
>>>>>{x1,x2,x3,..........xN}
>>>>>
>>>>>and a test function ( relatedQ[x_,y_,n_] ) that decides whether two
>>>>>elements are related. The actual test may vary, but returns True or
>>>>>False for any two elements as a function of n (n is an independent
>>>>>variable not the number of elements in the list}.
>>>>>
>>>>>I would like to process this list to aggregate those members that are
>>>>>related either directly to each other or to a common element. For
>>>>>example if x1 is unrelated to any other element in the list, x2 is
>>>>>related to x5 but not x7 and x5 is related to x7 etc. would produce the
>>>>>list of lists
>>>>>
>>>>>{{x1},{x2,x5,x7),.......}
>>>>>
>>>>>To take a specific example, assume we have the list
>>>>>
>>>>>list={1,2,6,7,8,9,11,16}
>>>>>
>>>>>and that application of the relatedQ, test to all possible pairs yields
>>>>>the list of related pairs
>>>>>
>>>>>prlist={{1,2},{1,6},{2,6},{2,7},{6,7},{6,11},{7,8},{7,11},{11,16}}
>>>>>
>>>>>Using the list and the list of related pairs we can deduce the desired
>>>>>list of lists as
>>>>>
>>>>>{{1,2,6,7,8,11,16},{9}}
>>>>>
>>>>
>>>>Here's one suboptimal method
>>>>[...]
>>>>Carl Woll
>>>>Wolfram Research
> 
> 
>>>[From Maxim Rytin:]
>>>If I understand the problem correctly, this amounts to finding connected
>>>components in a graph -- there was a similar question asked on MathGroup
>>>a
>>>while ago.
>>>
>>>In[1]:=
>>><<discretemath`
>>>agg[L_, $Lpair_] := Module[
>>>   {Lpair = $Lpair, LLind},
>>>   Lpair = Lpair /. Thread[L -> Range@ Length@ L];
>>>   LLind = ConnectedComponents[
>>>     FromUnorderedPairs[Lpair, CircularEmbedding[Length@ L]]];
>>>   Extract[L, List /@ #]& /@ LLind
>>>]
>>>
>>>In[3]:=
>>>agg[{1, 2, 6, 7, 8, 9, 11, 16},
>>>     {{1, 2}, {1, 6}, {2, 6}, {2, 7}, {6, 7},
>>>      {6, 11}, {7, 8}, {7, 11}, {11, 16}}]
>>>
>>>Out[3]=
>>>{{1, 2, 6, 7, 8, 11, 16}, {9}}
>>>
>>>In[4]:=
>>>agg2[L_, Lpair_] := Module[
>>>   {n = Max@ L, sp1, sp2, t},
>>>   sp1 = SparseArray[Thread[Lpair -> 1], {n, n}];
>>>   sp2 = SparseArray[{#, #} -> 1& /@ L, {n, n}];
>>>   t = sp1 + sp2 + Transpose@ sp1;
>>>   Flatten@ Position[#, 1]& /@ Union@ Normal@
>>>       FixedPoint[Sign[#.#]&, t, SameTest -> Equal] //
>>>     DeleteCases[#, {}]&
>>>]
>>>
>>>In[5]:=
>>>n = 500;
>>>pairs = Array[Random[Integer, {1, n}]&, {n, 2}];
>>>(ans = agg[Range@ n, pairs];) // Timing
>>>(ans2 = agg2[Range@ n, pairs];) // Timing
>>>Sort@ ans === Sort@ ans2
>>>
>>>Out[7]= {0.062 Second, Null}
>>>
>>>Out[8]= {2.469 Second, Null}
>>>
>>>Out[9]= True
>>>[...]
>>>Maxim Rytin
>>>m.r at inbox.ru
> 
> 
>>[From Carl Woll:]
>>After I proposed my solution, I realized that for input which creates
>>large components my solution was very inefficient, as you state above. A
>>simple remedy is to work with a single element, and find each component
>>by using matrix vector multiplication. In fact, I believe something like
>>this is exactly what ConnectedComponents from Combinatorica does.
>>
>>It still seems to me that using SparseArray objects ought to be quicker.
>>Here is your function, modified to assume that the n elements are the
>>integers from 1 to n, as suggested by Danny Lichtblau:
>>
>>agg0[n_, Lpair_] := Module[{LLind},
>>   LLind = ConnectedComponents[
>>       FromUnorderedPairs[Lpair, CircularEmbedding[n]]];
>>   Sort[Extract[Range[n], List /@ #] & /@ LLind]]
>>
>>Here is my original idea, modified to use SameTest->Equal as you mention
>>above:
>>
>>agg1[n_, pairs_] := Module[{sp, t},
>>   sp = SparseArray[Thread[pairs -> 1], {n, n}];
>>   t = Sign[sp + SparseArray[{i_, i_} -> 1, {n, n}] + Transpose[sp]];
>>   Union[nonzeros /@ FixedPoint[Sign[#.#]& , t, SameTest -> Equal]]
>>   ]
>>
>>I use the following helper function:
>>
>>nonzeros[a_SparseArray] := a /. SparseArray[_, _, _, x_] :>
>>   Flatten[x[[2, 2]]]
>>
>>Finally, here is a function which finds each component one at a time to
>>avoid SparseArray objects which aren't really sparse. Basically, the
>>algorithm of ConnectedComponents using SparseArray objects:
>>
>>agg2[n_, pairs_] := Module[{sp, t, candidates, nextcomp},
>>   sp = SparseArray[Thread[pairs -> 1], {n, n}];
>>   t = Sign[sp + SparseArray[{i_, i_} -> 1, {n, n}] + Transpose[sp]];
>>   candidates = Range[n];
>>   Sort[
>>     Reap[
>>       While[
>>         candidates != {},
>>         v = SparseArray[candidates[[1]] -> 1, n];
>>         nextcomp = nonzeros[FixedPoint[Sign[t.#1]&,v,SameTest->Equal]];
>>         candidates = Complement[candidates, Sow[nextcomp]];
>>         ];
>>     ][[2, 1]]
>>    ]
>>   ]
>>
>>Let's do some comparisons:
>>
>>In[56]:=
>>n=100;
>>p=Table[Random[Integer,{1,n}],{n},{2}];
>>
>>In[62]:=
>>r1=agg0[n,p];//Timing
>>r2=agg1[n,p];//Timing
>>r3=agg2[n,p];//Timing
>>r1===r2===r3
>>
>>Out[62]=
>>{0. Second,Null}
>>
>>Out[63]=
>>{0.031 Second,Null}
>>
>>Out[64]=
>>{0. Second,Null}
>>
>>Out[65]=
>>True
>>
>>Good they all agree. Now, for a larger sample set:
>>
>>In[69]:=
>>n=3000;
>>p=Table[Random[Integer,{1,n}],{n},{2}];
>>
>>In[71]:=
>>r1=agg0[n,p];//Timing
>>(*r2=agg1[n,p];//Timing*)
>>r3=agg2[n,p];//Timing
>>r1===r3
>>
>>Out[71]=
>>{1.015 Second,Null}
>>
>>Out[72]=
>>{0.282 Second,Null}
>>
>>Out[73]=
>>True
>>
>>We see that agg2 is about 4 times faster than agg0 here. I don't bother
>>to try agg1 because it probably will never finish. Some statistics on
>>the sample set:
>>
>>In[76]:=
>>{Max[Length/@r1],Length[r1]}
>>
>>Out[76]=
>>{2410,473}
>>[...]
>>[...]
>>Carl Woll
>>Wolfram Research
> 
> 
> If I admit Floyd-Warshall was overkill and far too slow for this problem,
> can I play again?
> 
> For set of m pairs giving connections between n elements, the code below
> has complexty O(n+m*log(m)). One can get rid of the log factor as the
> sorting is not strictly needed, but in practice it is not the bottleneck
> and actually tends to improve speed.
> 
> aggregate[n_, pairs_] := Module[
>   {hh, aggs, kk, ll, mm, spairs = Sort[Map[Sort, pairs]], fm},
>   aggs = Map[hh, Range[n]];
>   Do[
>    {kk, mm} = spairs[[j]];
>    ll = First[hh[kk]];
>    fm = First[hh[mm]];
>    If[fm === mm,
>     hh[mm] = hh[ll],
>     If[ll < fm,
>      hh[mm] = hh[ll];
>      hh[First[hh[fm]]] = hh[ll],
>      hh[ll] = hh[mm]; ll = fm]];
>    , {j, Length[spairs]}];
>   Last[Reap[Do[ll = hh[j]; Sow[j, ll], {j, n}]]]
>   ]
> 
> The idea is to use a "marker function" (hh) to record the current smallest
> element related to the one in question. Note that if I set hh[7] tp hh[3],
> and later learn that hh[3] should become hh[1] because element 3 is found
> to be related to element 1, then the assignment hh[3]=hh[1] suffices to
> change it everywhere hh[3] may be found. So there is no backtracing
> needed. We loop once over the pairs to create associations, and once over
> the list elements to put associated elements into buckets.
> 
> n = 1000;
> pairs = Table[Random[Integer, {1, n}], {n}, {2}];
> 
> In[65]:= Sort[agg0[n,pairs]]===Sort[agg2[n,pairs]]===
>   Sort[aggregate[n,pairs]]
> Out[65]= True
> 
> We check the speed:
> 
> n = 10000;
> pairs = Table[Random[Integer, {1, n}], {n}, {2}];
> 
> In[59]:= {Timing[agg0[n,pairs];],Timing[agg2[n,pairs];],
>   Timing[aggregate[n,pairs];]}
> Out[59]= {{4.593 Second,Null},{1.297 Second,Null},
>   {0.297 Second,Null}}
> 
> Now double the size:
> 
> n = 20000;
> pairs = Table[Random[Integer, {1, n}], {n}, {2}];
> 
> In[62]:= {Timing[agg0[n,pairs];],Timing[agg2[n,pairs];],
>   Timing[aggregate[n,pairs];]}
> Out[62]= {{28.609 Second,Null},{5.657 Second,Null},
>   {0.656 Second,Null}}
> 
> 
> Daniel Lichtblau
> Wolfram Research
> 

Daniel Lichtblau's algorithm is very nice, and the scaling is close to 
optimal. However, I still believe that a SparseArray based approach 
might be quicker. The algorithms used to manipulate SparseArray objects 
are highly optimized, so tapping into these extremely quick algorithms 
can be very useful.

At any rate, I revised my old algorithm as follows:

1. Construct a transition matrix t.
2. While the matrix t is not empty:
a. Purge any components which already have more than 15 elements. That 
is, if any row has more than 15 elements, find all elements related to 
this element by finding a fixed point using matrix vector dot products.
b. Square t
c. Purge any components which are complete. That is, if any rows were 
unchanged after squaring, delete them.
d. Repeat

Here is the code:

agg[n_, pairs_] :=
  Module[{sp, t, rowcounts, oldrowcounts, component, complete, tmp},
     sp = SparseArray[Thread[pairs -> 1], {n, n}];
     t = Sign[sp + SparseArray[{i_,i_}->1, {n, n}] + Transpose[sp]];
     rowcounts = countrows[t];
     Reap[
     While[
      Total[rowcounts] > 0 (*not empty*),
      While[Max[rowcounts] > 15 (*purge long components*),
       component = FixedPoint[
         Sign[t . #1] & ,
         t[[Ordering[rowcounts, -1][[1]]]]
         ];
       Sow[nonzeros[component]];
       t = sparsediagonal[1 - component] . t;
       rowcounts = countrows[t]
       ];
      oldrowcounts = rowcounts;
      t = Sign[t . t]; (* square *)
      rowcounts = countrows[t];
      complete = Sign[rowcounts] + Sign[oldrowcounts - rowcounts];
      If[Total[complete] > 0, (* purge complete rows *)
       tmp = (t . sparsediagonal[Range[n]])[[nonzeros[complete]]];
       Sow /@ Union[
         List @@ tmp /. SparseArray[_, _, _, {__, x_}] :> x
         ];
       t = sparsediagonal[1 - Sign[SparseArray[complete] . t]] . t;
       rowcounts = countrows[t];
       ]
      ]
     ][[2, 1]]
   ]

countrows[s_] := s /. SparseArray[_, _, _, {_, {x_, _}, _}] :> 
Differences[x]

sparsediagonal[v_] := SparseArray[Table[{i, i}, {i, Length[v]}] -> 
Normal[v]]

nonzeros[a_] :=
  SparseArray[a] /. SparseArray[_, _, _, {_, {_, x_}, _}] :> Flatten[x]

The helper function countrows counts the number of nonzero elements in 
each row of a sparse matrix, sparsediagonal converts a vector into a 
sparse matrix with the vector as diagonal elements, and nonzeros returns 
the positions of the nonzero elements in a vector. As an example where 
tapping into the sparse algorithms is helpful consider the following 
timings:

v = Table[Random[Integer], {10^6}];

In[53]:=
Timing[r1 = Flatten[Position[v, Except[0], 1, Heads -> False]]; ]
Timing[r2 = nonzeros[v]; ]
r1 === r2

Out[53]=
{1.609 Second,Null}

Out[54]=
{0.047 Second,Null}

Out[55]=
True

We see that nonzeros is over 30 times faster than using Position.

Returning to the connected components problem, here is a test set, and 
comparisons of agg with agg3 (maxim) and aggregate (dan lichtblau):

In[56]:=
SeedRandom[1];
n =40000;
pairs = Table[RandomInteger[{1,n}],{n},{2}];

In[59]:=
r1=agg[n,pairs];//Timing
r2=agg3[n,pairs];//Timing
r3=aggregate[n,pairs];//Timing
Sort[Sort/@r1]===Sort[Sort/@r2]===Sort[Sort/@r3]

Out[59]=
{1.219 Second,Null}

Out[60]=
{2.469 Second,Null}

Out[61]=
{2.25 Second,Null}

Out[62]=
True

So, agg appears to be about twice as fast for this sort of data set.

Carl Woll
Wolfram Research
Prev by Date: Re: Re: MLPutRealList vs. sequence of MLPutDouble
Next by Date: help on kind of 'inverse exponential' function ?
Previous by thread: Re: Re: MLPutRealList vs. sequence of MLPutDouble
Next by thread: Re: aggregation of related elements in a list