Re: Re: Re: Re: aggregation of related elements in a list
- To: mathgroup at smc.vnet.net
- Subject: [mg61867] Re: [mg61793] Re: [mg61762] Re: [mg61730] Re: aggregation of related elements in a list
- From: "Carl K. Woll" <carl at woll2woll.com>
- Date: Wed, 2 Nov 2005 04:09:38 -0500 (EST)
- References: <djcgcs$6du$1@smc.vnet.net> <djfnjr$b1i$1@smc.vnet.net> <200510270902.FAA19490@smc.vnet.net> <200510280725.DAA08769@smc.vnet.net> <200510300443.AAA09785@smc.vnet.net>
- Sender: owner-wri-mathgroup at wolfram.com
danl at wolfram.com wrote:
>>[From Carl Woll:]
>>I decided to compare my approach and another which I thought up
>>previously with Maxim's approach in a bit more detail below. My
>>conclusion was that it should be possible to do things quite a bit
>>quicker using SparseArray objects instead of using the
>>ConnectedComponents function from Combinatorica.
>>
>>Maxim wrote:
>>
>>>On Sun, 23 Oct 2005 10:11:07 +0000 (UTC), Carl K. Woll
>>><carlw at u.washington.edu> wrote:
>>>
>>>
>>>
>>>>leigh pascoe wrote:
>>>>
>>>>
>>>>>Dear Mathgroup,
>>>>>
>>>>>I would like to construct a function to transform a list of integers
>>>>>into a list of lists of related members. Suppose we have the finite
>>>>>list
>>>>>
>>>>>{x1,x2,x3,..........xN}
>>>>>
>>>>>and a test function ( relatedQ[x_,y_,n_] ) that decides whether two
>>>>>elements are related. The actual test may vary, but returns True or
>>>>>False for any two elements as a function of n (n is an independent
>>>>>variable not the number of elements in the list}.
>>>>>
>>>>>I would like to process this list to aggregate those members that are
>>>>>related either directly to each other or to a common element. For
>>>>>example if x1 is unrelated to any other element in the list, x2 is
>>>>>related to x5 but not x7 and x5 is related to x7 etc. would produce the
>>>>>list of lists
>>>>>
>>>>>{{x1},{x2,x5,x7),.......}
>>>>>
>>>>>To take a specific example, assume we have the list
>>>>>
>>>>>list={1,2,6,7,8,9,11,16}
>>>>>
>>>>>and that application of the relatedQ, test to all possible pairs yields
>>>>>the list of related pairs
>>>>>
>>>>>prlist={{1,2},{1,6},{2,6},{2,7},{6,7},{6,11},{7,8},{7,11},{11,16}}
>>>>>
>>>>>Using the list and the list of related pairs we can deduce the desired
>>>>>list of lists as
>>>>>
>>>>>{{1,2,6,7,8,11,16},{9}}
>>>>>
>>>>
>>>>Here's one suboptimal method
>>>>[...]
>>>>Carl Woll
>>>>Wolfram Research
>
>
>>>[From Maxim Rytin:]
>>>If I understand the problem correctly, this amounts to finding connected
>>>components in a graph -- there was a similar question asked on MathGroup
>>>a
>>>while ago.
>>>
>>>In[1]:=
>>><<discretemath`
>>>agg[L_, $Lpair_] := Module[
>>> {Lpair = $Lpair, LLind},
>>> Lpair = Lpair /. Thread[L -> Range@ Length@ L];
>>> LLind = ConnectedComponents[
>>> FromUnorderedPairs[Lpair, CircularEmbedding[Length@ L]]];
>>> Extract[L, List /@ #]& /@ LLind
>>>]
>>>
>>>In[3]:=
>>>agg[{1, 2, 6, 7, 8, 9, 11, 16},
>>> {{1, 2}, {1, 6}, {2, 6}, {2, 7}, {6, 7},
>>> {6, 11}, {7, 8}, {7, 11}, {11, 16}}]
>>>
>>>Out[3]=
>>>{{1, 2, 6, 7, 8, 11, 16}, {9}}
>>>
>>>In[4]:=
>>>agg2[L_, Lpair_] := Module[
>>> {n = Max@ L, sp1, sp2, t},
>>> sp1 = SparseArray[Thread[Lpair -> 1], {n, n}];
>>> sp2 = SparseArray[{#, #} -> 1& /@ L, {n, n}];
>>> t = sp1 + sp2 + Transpose@ sp1;
>>> Flatten@ Position[#, 1]& /@ Union@ Normal@
>>> FixedPoint[Sign[#.#]&, t, SameTest -> Equal] //
>>> DeleteCases[#, {}]&
>>>]
>>>
>>>In[5]:=
>>>n = 500;
>>>pairs = Array[Random[Integer, {1, n}]&, {n, 2}];
>>>(ans = agg[Range@ n, pairs];) // Timing
>>>(ans2 = agg2[Range@ n, pairs];) // Timing
>>>Sort@ ans === Sort@ ans2
>>>
>>>Out[7]= {0.062 Second, Null}
>>>
>>>Out[8]= {2.469 Second, Null}
>>>
>>>Out[9]= True
>>>[...]
>>>Maxim Rytin
>>>m.r at inbox.ru
>
>
>>[From Carl Woll:]
>>After I proposed my solution, I realized that for input which creates
>>large components my solution was very inefficient, as you state above. A
>>simple remedy is to work with a single element, and find each component
>>by using matrix vector multiplication. In fact, I believe something like
>>this is exactly what ConnectedComponents from Combinatorica does.
>>
>>It still seems to me that using SparseArray objects ought to be quicker.
>>Here is your function, modified to assume that the n elements are the
>>integers from 1 to n, as suggested by Danny Lichtblau:
>>
>>agg0[n_, Lpair_] := Module[{LLind},
>> LLind = ConnectedComponents[
>> FromUnorderedPairs[Lpair, CircularEmbedding[n]]];
>> Sort[Extract[Range[n], List /@ #] & /@ LLind]]
>>
>>Here is my original idea, modified to use SameTest->Equal as you mention
>>above:
>>
>>agg1[n_, pairs_] := Module[{sp, t},
>> sp = SparseArray[Thread[pairs -> 1], {n, n}];
>> t = Sign[sp + SparseArray[{i_, i_} -> 1, {n, n}] + Transpose[sp]];
>> Union[nonzeros /@ FixedPoint[Sign[#.#]& , t, SameTest -> Equal]]
>> ]
>>
>>I use the following helper function:
>>
>>nonzeros[a_SparseArray] := a /. SparseArray[_, _, _, x_] :>
>> Flatten[x[[2, 2]]]
>>
>>Finally, here is a function which finds each component one at a time to
>>avoid SparseArray objects which aren't really sparse. Basically, the
>>algorithm of ConnectedComponents using SparseArray objects:
>>
>>agg2[n_, pairs_] := Module[{sp, t, candidates, nextcomp},
>> sp = SparseArray[Thread[pairs -> 1], {n, n}];
>> t = Sign[sp + SparseArray[{i_, i_} -> 1, {n, n}] + Transpose[sp]];
>> candidates = Range[n];
>> Sort[
>> Reap[
>> While[
>> candidates != {},
>> v = SparseArray[candidates[[1]] -> 1, n];
>> nextcomp = nonzeros[FixedPoint[Sign[t.#1]&,v,SameTest->Equal]];
>> candidates = Complement[candidates, Sow[nextcomp]];
>> ];
>> ][[2, 1]]
>> ]
>> ]
>>
>>Let's do some comparisons:
>>
>>In[56]:=
>>n=100;
>>p=Table[Random[Integer,{1,n}],{n},{2}];
>>
>>In[62]:=
>>r1=agg0[n,p];//Timing
>>r2=agg1[n,p];//Timing
>>r3=agg2[n,p];//Timing
>>r1===r2===r3
>>
>>Out[62]=
>>{0. Second,Null}
>>
>>Out[63]=
>>{0.031 Second,Null}
>>
>>Out[64]=
>>{0. Second,Null}
>>
>>Out[65]=
>>True
>>
>>Good they all agree. Now, for a larger sample set:
>>
>>In[69]:=
>>n=3000;
>>p=Table[Random[Integer,{1,n}],{n},{2}];
>>
>>In[71]:=
>>r1=agg0[n,p];//Timing
>>(*r2=agg1[n,p];//Timing*)
>>r3=agg2[n,p];//Timing
>>r1===r3
>>
>>Out[71]=
>>{1.015 Second,Null}
>>
>>Out[72]=
>>{0.282 Second,Null}
>>
>>Out[73]=
>>True
>>
>>We see that agg2 is about 4 times faster than agg0 here. I don't bother
>>to try agg1 because it probably will never finish. Some statistics on
>>the sample set:
>>
>>In[76]:=
>>{Max[Length/@r1],Length[r1]}
>>
>>Out[76]=
>>{2410,473}
>>[...]
>>[...]
>>Carl Woll
>>Wolfram Research
>
>
> If I admit Floyd-Warshall was overkill and far too slow for this problem,
> can I play again?
>
> For set of m pairs giving connections between n elements, the code below
> has complexty O(n+m*log(m)). One can get rid of the log factor as the
> sorting is not strictly needed, but in practice it is not the bottleneck
> and actually tends to improve speed.
>
> aggregate[n_, pairs_] := Module[
> {hh, aggs, kk, ll, mm, spairs = Sort[Map[Sort, pairs]], fm},
> aggs = Map[hh, Range[n]];
> Do[
> {kk, mm} = spairs[[j]];
> ll = First[hh[kk]];
> fm = First[hh[mm]];
> If[fm === mm,
> hh[mm] = hh[ll],
> If[ll < fm,
> hh[mm] = hh[ll];
> hh[First[hh[fm]]] = hh[ll],
> hh[ll] = hh[mm]; ll = fm]];
> , {j, Length[spairs]}];
> Last[Reap[Do[ll = hh[j]; Sow[j, ll], {j, n}]]]
> ]
>
> The idea is to use a "marker function" (hh) to record the current smallest
> element related to the one in question. Note that if I set hh[7] tp hh[3],
> and later learn that hh[3] should become hh[1] because element 3 is found
> to be related to element 1, then the assignment hh[3]=hh[1] suffices to
> change it everywhere hh[3] may be found. So there is no backtracing
> needed. We loop once over the pairs to create associations, and once over
> the list elements to put associated elements into buckets.
>
> n = 1000;
> pairs = Table[Random[Integer, {1, n}], {n}, {2}];
>
> In[65]:= Sort[agg0[n,pairs]]===Sort[agg2[n,pairs]]===
> Sort[aggregate[n,pairs]]
> Out[65]= True
>
> We check the speed:
>
> n = 10000;
> pairs = Table[Random[Integer, {1, n}], {n}, {2}];
>
> In[59]:= {Timing[agg0[n,pairs];],Timing[agg2[n,pairs];],
> Timing[aggregate[n,pairs];]}
> Out[59]= {{4.593 Second,Null},{1.297 Second,Null},
> {0.297 Second,Null}}
>
> Now double the size:
>
> n = 20000;
> pairs = Table[Random[Integer, {1, n}], {n}, {2}];
>
> In[62]:= {Timing[agg0[n,pairs];],Timing[agg2[n,pairs];],
> Timing[aggregate[n,pairs];]}
> Out[62]= {{28.609 Second,Null},{5.657 Second,Null},
> {0.656 Second,Null}}
>
>
> Daniel Lichtblau
> Wolfram Research
>
Daniel Lichtblau's algorithm is very nice, and the scaling is close to
optimal. However, I still believe that a SparseArray based approach
might be quicker. The algorithms used to manipulate SparseArray objects
are highly optimized, so tapping into these extremely quick algorithms
can be very useful.
At any rate, I revised my old algorithm as follows:
1. Construct a transition matrix t.
2. While the matrix t is not empty:
a. Purge any components which already have more than 15 elements. That
is, if any row has more than 15 elements, find all elements related to
this element by finding a fixed point using matrix vector dot products.
b. Square t
c. Purge any components which are complete. That is, if any rows were
unchanged after squaring, delete them.
d. Repeat
Here is the code:
agg[n_, pairs_] :=
Module[{sp, t, rowcounts, oldrowcounts, component, complete, tmp},
sp = SparseArray[Thread[pairs -> 1], {n, n}];
t = Sign[sp + SparseArray[{i_,i_}->1, {n, n}] + Transpose[sp]];
rowcounts = countrows[t];
Reap[
While[
Total[rowcounts] > 0 (*not empty*),
While[Max[rowcounts] > 15 (*purge long components*),
component = FixedPoint[
Sign[t . #1] & ,
t[[Ordering[rowcounts, -1][[1]]]]
];
Sow[nonzeros[component]];
t = sparsediagonal[1 - component] . t;
rowcounts = countrows[t]
];
oldrowcounts = rowcounts;
t = Sign[t . t]; (* square *)
rowcounts = countrows[t];
complete = Sign[rowcounts] + Sign[oldrowcounts - rowcounts];
If[Total[complete] > 0, (* purge complete rows *)
tmp = (t . sparsediagonal[Range[n]])[[nonzeros[complete]]];
Sow /@ Union[
List @@ tmp /. SparseArray[_, _, _, {__, x_}] :> x
];
t = sparsediagonal[1 - Sign[SparseArray[complete] . t]] . t;
rowcounts = countrows[t];
]
]
][[2, 1]]
]
countrows[s_] := s /. SparseArray[_, _, _, {_, {x_, _}, _}] :>
Differences[x]
sparsediagonal[v_] := SparseArray[Table[{i, i}, {i, Length[v]}] ->
Normal[v]]
nonzeros[a_] :=
SparseArray[a] /. SparseArray[_, _, _, {_, {_, x_}, _}] :> Flatten[x]
The helper function countrows counts the number of nonzero elements in
each row of a sparse matrix, sparsediagonal converts a vector into a
sparse matrix with the vector as diagonal elements, and nonzeros returns
the positions of the nonzero elements in a vector. As an example where
tapping into the sparse algorithms is helpful consider the following
timings:
v = Table[Random[Integer], {10^6}];
In[53]:=
Timing[r1 = Flatten[Position[v, Except[0], 1, Heads -> False]]; ]
Timing[r2 = nonzeros[v]; ]
r1 === r2
Out[53]=
{1.609 Second,Null}
Out[54]=
{0.047 Second,Null}
Out[55]=
True
We see that nonzeros is over 30 times faster than using Position.
Returning to the connected components problem, here is a test set, and
comparisons of agg with agg3 (maxim) and aggregate (dan lichtblau):
In[56]:=
SeedRandom[1];
n =40000;
pairs = Table[RandomInteger[{1,n}],{n},{2}];
In[59]:=
r1=agg[n,pairs];//Timing
r2=agg3[n,pairs];//Timing
r3=aggregate[n,pairs];//Timing
Sort[Sort/@r1]===Sort[Sort/@r2]===Sort[Sort/@r3]
Out[59]=
{1.219 Second,Null}
Out[60]=
{2.469 Second,Null}
Out[61]=
{2.25 Second,Null}
Out[62]=
True
So, agg appears to be about twice as fast for this sort of data set.
Carl Woll
Wolfram Research