Re: Re: Re: Re: aggregation of related elements in a list

*To*: mathgroup at smc.vnet.net*Subject*: [mg61867] Re: [mg61793] Re: [mg61762] Re: [mg61730] Re: aggregation of related elements in a list*From*: "Carl K. Woll" <carl at woll2woll.com>*Date*: Wed, 2 Nov 2005 04:09:38 -0500 (EST)*References*: <djcgcs$6du$1@smc.vnet.net> <djfnjr$b1i$1@smc.vnet.net> <200510270902.FAA19490@smc.vnet.net> <200510280725.DAA08769@smc.vnet.net> <200510300443.AAA09785@smc.vnet.net>*Sender*: owner-wri-mathgroup at wolfram.com

danl at wolfram.com wrote: >>[From Carl Woll:] >>I decided to compare my approach and another which I thought up >>previously with Maxim's approach in a bit more detail below. My >>conclusion was that it should be possible to do things quite a bit >>quicker using SparseArray objects instead of using the >>ConnectedComponents function from Combinatorica. >> >>Maxim wrote: >> >>>On Sun, 23 Oct 2005 10:11:07 +0000 (UTC), Carl K. Woll >>><carlw at u.washington.edu> wrote: >>> >>> >>> >>>>leigh pascoe wrote: >>>> >>>> >>>>>Dear Mathgroup, >>>>> >>>>>I would like to construct a function to transform a list of integers >>>>>into a list of lists of related members. Suppose we have the finite >>>>>list >>>>> >>>>>{x1,x2,x3,..........xN} >>>>> >>>>>and a test function ( relatedQ[x_,y_,n_] ) that decides whether two >>>>>elements are related. The actual test may vary, but returns True or >>>>>False for any two elements as a function of n (n is an independent >>>>>variable not the number of elements in the list}. >>>>> >>>>>I would like to process this list to aggregate those members that are >>>>>related either directly to each other or to a common element. For >>>>>example if x1 is unrelated to any other element in the list, x2 is >>>>>related to x5 but not x7 and x5 is related to x7 etc. would produce the >>>>>list of lists >>>>> >>>>>{{x1},{x2,x5,x7),.......} >>>>> >>>>>To take a specific example, assume we have the list >>>>> >>>>>list={1,2,6,7,8,9,11,16} >>>>> >>>>>and that application of the relatedQ, test to all possible pairs yields >>>>>the list of related pairs >>>>> >>>>>prlist={{1,2},{1,6},{2,6},{2,7},{6,7},{6,11},{7,8},{7,11},{11,16}} >>>>> >>>>>Using the list and the list of related pairs we can deduce the desired >>>>>list of lists as >>>>> >>>>>{{1,2,6,7,8,11,16},{9}} >>>>> >>>> >>>>Here's one suboptimal method >>>>[...] >>>>Carl Woll >>>>Wolfram Research > > >>>[From Maxim Rytin:] >>>If I understand the problem correctly, this amounts to finding connected >>>components in a graph -- there was a similar question asked on MathGroup >>>a >>>while ago. >>> >>>In[1]:= >>><<discretemath` >>>agg[L_, $Lpair_] := Module[ >>> {Lpair = $Lpair, LLind}, >>> Lpair = Lpair /. Thread[L -> Range@ Length@ L]; >>> LLind = ConnectedComponents[ >>> FromUnorderedPairs[Lpair, CircularEmbedding[Length@ L]]]; >>> Extract[L, List /@ #]& /@ LLind >>>] >>> >>>In[3]:= >>>agg[{1, 2, 6, 7, 8, 9, 11, 16}, >>> {{1, 2}, {1, 6}, {2, 6}, {2, 7}, {6, 7}, >>> {6, 11}, {7, 8}, {7, 11}, {11, 16}}] >>> >>>Out[3]= >>>{{1, 2, 6, 7, 8, 11, 16}, {9}} >>> >>>In[4]:= >>>agg2[L_, Lpair_] := Module[ >>> {n = Max@ L, sp1, sp2, t}, >>> sp1 = SparseArray[Thread[Lpair -> 1], {n, n}]; >>> sp2 = SparseArray[{#, #} -> 1& /@ L, {n, n}]; >>> t = sp1 + sp2 + Transpose@ sp1; >>> Flatten@ Position[#, 1]& /@ Union@ Normal@ >>> FixedPoint[Sign[#.#]&, t, SameTest -> Equal] // >>> DeleteCases[#, {}]& >>>] >>> >>>In[5]:= >>>n = 500; >>>pairs = Array[Random[Integer, {1, n}]&, {n, 2}]; >>>(ans = agg[Range@ n, pairs];) // Timing >>>(ans2 = agg2[Range@ n, pairs];) // Timing >>>Sort@ ans === Sort@ ans2 >>> >>>Out[7]= {0.062 Second, Null} >>> >>>Out[8]= {2.469 Second, Null} >>> >>>Out[9]= True >>>[...] >>>Maxim Rytin >>>m.r at inbox.ru > > >>[From Carl Woll:] >>After I proposed my solution, I realized that for input which creates >>large components my solution was very inefficient, as you state above. A >>simple remedy is to work with a single element, and find each component >>by using matrix vector multiplication. In fact, I believe something like >>this is exactly what ConnectedComponents from Combinatorica does. >> >>It still seems to me that using SparseArray objects ought to be quicker. >>Here is your function, modified to assume that the n elements are the >>integers from 1 to n, as suggested by Danny Lichtblau: >> >>agg0[n_, Lpair_] := Module[{LLind}, >> LLind = ConnectedComponents[ >> FromUnorderedPairs[Lpair, CircularEmbedding[n]]]; >> Sort[Extract[Range[n], List /@ #] & /@ LLind]] >> >>Here is my original idea, modified to use SameTest->Equal as you mention >>above: >> >>agg1[n_, pairs_] := Module[{sp, t}, >> sp = SparseArray[Thread[pairs -> 1], {n, n}]; >> t = Sign[sp + SparseArray[{i_, i_} -> 1, {n, n}] + Transpose[sp]]; >> Union[nonzeros /@ FixedPoint[Sign[#.#]& , t, SameTest -> Equal]] >> ] >> >>I use the following helper function: >> >>nonzeros[a_SparseArray] := a /. SparseArray[_, _, _, x_] :> >> Flatten[x[[2, 2]]] >> >>Finally, here is a function which finds each component one at a time to >>avoid SparseArray objects which aren't really sparse. Basically, the >>algorithm of ConnectedComponents using SparseArray objects: >> >>agg2[n_, pairs_] := Module[{sp, t, candidates, nextcomp}, >> sp = SparseArray[Thread[pairs -> 1], {n, n}]; >> t = Sign[sp + SparseArray[{i_, i_} -> 1, {n, n}] + Transpose[sp]]; >> candidates = Range[n]; >> Sort[ >> Reap[ >> While[ >> candidates != {}, >> v = SparseArray[candidates[[1]] -> 1, n]; >> nextcomp = nonzeros[FixedPoint[Sign[t.#1]&,v,SameTest->Equal]]; >> candidates = Complement[candidates, Sow[nextcomp]]; >> ]; >> ][[2, 1]] >> ] >> ] >> >>Let's do some comparisons: >> >>In[56]:= >>n=100; >>p=Table[Random[Integer,{1,n}],{n},{2}]; >> >>In[62]:= >>r1=agg0[n,p];//Timing >>r2=agg1[n,p];//Timing >>r3=agg2[n,p];//Timing >>r1===r2===r3 >> >>Out[62]= >>{0. Second,Null} >> >>Out[63]= >>{0.031 Second,Null} >> >>Out[64]= >>{0. Second,Null} >> >>Out[65]= >>True >> >>Good they all agree. Now, for a larger sample set: >> >>In[69]:= >>n=3000; >>p=Table[Random[Integer,{1,n}],{n},{2}]; >> >>In[71]:= >>r1=agg0[n,p];//Timing >>(*r2=agg1[n,p];//Timing*) >>r3=agg2[n,p];//Timing >>r1===r3 >> >>Out[71]= >>{1.015 Second,Null} >> >>Out[72]= >>{0.282 Second,Null} >> >>Out[73]= >>True >> >>We see that agg2 is about 4 times faster than agg0 here. I don't bother >>to try agg1 because it probably will never finish. Some statistics on >>the sample set: >> >>In[76]:= >>{Max[Length/@r1],Length[r1]} >> >>Out[76]= >>{2410,473} >>[...] >>[...] >>Carl Woll >>Wolfram Research > > > If I admit Floyd-Warshall was overkill and far too slow for this problem, > can I play again? > > For set of m pairs giving connections between n elements, the code below > has complexty O(n+m*log(m)). One can get rid of the log factor as the > sorting is not strictly needed, but in practice it is not the bottleneck > and actually tends to improve speed. > > aggregate[n_, pairs_] := Module[ > {hh, aggs, kk, ll, mm, spairs = Sort[Map[Sort, pairs]], fm}, > aggs = Map[hh, Range[n]]; > Do[ > {kk, mm} = spairs[[j]]; > ll = First[hh[kk]]; > fm = First[hh[mm]]; > If[fm === mm, > hh[mm] = hh[ll], > If[ll < fm, > hh[mm] = hh[ll]; > hh[First[hh[fm]]] = hh[ll], > hh[ll] = hh[mm]; ll = fm]]; > , {j, Length[spairs]}]; > Last[Reap[Do[ll = hh[j]; Sow[j, ll], {j, n}]]] > ] > > The idea is to use a "marker function" (hh) to record the current smallest > element related to the one in question. Note that if I set hh[7] tp hh[3], > and later learn that hh[3] should become hh[1] because element 3 is found > to be related to element 1, then the assignment hh[3]=hh[1] suffices to > change it everywhere hh[3] may be found. So there is no backtracing > needed. We loop once over the pairs to create associations, and once over > the list elements to put associated elements into buckets. > > n = 1000; > pairs = Table[Random[Integer, {1, n}], {n}, {2}]; > > In[65]:= Sort[agg0[n,pairs]]===Sort[agg2[n,pairs]]=== > Sort[aggregate[n,pairs]] > Out[65]= True > > We check the speed: > > n = 10000; > pairs = Table[Random[Integer, {1, n}], {n}, {2}]; > > In[59]:= {Timing[agg0[n,pairs];],Timing[agg2[n,pairs];], > Timing[aggregate[n,pairs];]} > Out[59]= {{4.593 Second,Null},{1.297 Second,Null}, > {0.297 Second,Null}} > > Now double the size: > > n = 20000; > pairs = Table[Random[Integer, {1, n}], {n}, {2}]; > > In[62]:= {Timing[agg0[n,pairs];],Timing[agg2[n,pairs];], > Timing[aggregate[n,pairs];]} > Out[62]= {{28.609 Second,Null},{5.657 Second,Null}, > {0.656 Second,Null}} > > > Daniel Lichtblau > Wolfram Research > Daniel Lichtblau's algorithm is very nice, and the scaling is close to optimal. However, I still believe that a SparseArray based approach might be quicker. The algorithms used to manipulate SparseArray objects are highly optimized, so tapping into these extremely quick algorithms can be very useful. At any rate, I revised my old algorithm as follows: 1. Construct a transition matrix t. 2. While the matrix t is not empty: a. Purge any components which already have more than 15 elements. That is, if any row has more than 15 elements, find all elements related to this element by finding a fixed point using matrix vector dot products. b. Square t c. Purge any components which are complete. That is, if any rows were unchanged after squaring, delete them. d. Repeat Here is the code: agg[n_, pairs_] := Module[{sp, t, rowcounts, oldrowcounts, component, complete, tmp}, sp = SparseArray[Thread[pairs -> 1], {n, n}]; t = Sign[sp + SparseArray[{i_,i_}->1, {n, n}] + Transpose[sp]]; rowcounts = countrows[t]; Reap[ While[ Total[rowcounts] > 0 (*not empty*), While[Max[rowcounts] > 15 (*purge long components*), component = FixedPoint[ Sign[t . #1] & , t[[Ordering[rowcounts, -1][[1]]]] ]; Sow[nonzeros[component]]; t = sparsediagonal[1 - component] . t; rowcounts = countrows[t] ]; oldrowcounts = rowcounts; t = Sign[t . t]; (* square *) rowcounts = countrows[t]; complete = Sign[rowcounts] + Sign[oldrowcounts - rowcounts]; If[Total[complete] > 0, (* purge complete rows *) tmp = (t . sparsediagonal[Range[n]])[[nonzeros[complete]]]; Sow /@ Union[ List @@ tmp /. SparseArray[_, _, _, {__, x_}] :> x ]; t = sparsediagonal[1 - Sign[SparseArray[complete] . t]] . t; rowcounts = countrows[t]; ] ] ][[2, 1]] ] countrows[s_] := s /. SparseArray[_, _, _, {_, {x_, _}, _}] :> Differences[x] sparsediagonal[v_] := SparseArray[Table[{i, i}, {i, Length[v]}] -> Normal[v]] nonzeros[a_] := SparseArray[a] /. SparseArray[_, _, _, {_, {_, x_}, _}] :> Flatten[x] The helper function countrows counts the number of nonzero elements in each row of a sparse matrix, sparsediagonal converts a vector into a sparse matrix with the vector as diagonal elements, and nonzeros returns the positions of the nonzero elements in a vector. As an example where tapping into the sparse algorithms is helpful consider the following timings: v = Table[Random[Integer], {10^6}]; In[53]:= Timing[r1 = Flatten[Position[v, Except[0], 1, Heads -> False]]; ] Timing[r2 = nonzeros[v]; ] r1 === r2 Out[53]= {1.609 Second,Null} Out[54]= {0.047 Second,Null} Out[55]= True We see that nonzeros is over 30 times faster than using Position. Returning to the connected components problem, here is a test set, and comparisons of agg with agg3 (maxim) and aggregate (dan lichtblau): In[56]:= SeedRandom[1]; n =40000; pairs = Table[RandomInteger[{1,n}],{n},{2}]; In[59]:= r1=agg[n,pairs];//Timing r2=agg3[n,pairs];//Timing r3=aggregate[n,pairs];//Timing Sort[Sort/@r1]===Sort[Sort/@r2]===Sort[Sort/@r3] Out[59]= {1.219 Second,Null} Out[60]= {2.469 Second,Null} Out[61]= {2.25 Second,Null} Out[62]= True So, agg appears to be about twice as fast for this sort of data set. Carl Woll Wolfram Research