MathGroup Archive: October 2005 [00936]

[Date Index] [Thread Index] [Author Index]
Re: Re: aggregation of related elements in a list
To: mathgroup at smc.vnet.net
Subject: [mg61799] Re: Re: aggregation of related elements in a list
From: Maxim <ab_def at prontomail.com>
Date: Sun, 30 Oct 2005 05:49:52 -0500 (EST)
References: <djcgcs$6du$1@smc.vnet.net> <djfnjr$b1i$1@smc.vnet.net> <200510270902.FAA19490@smc.vnet.net> <djsl5n$8qd$1@smc.vnet.net>
Sender: owner-wri-mathgroup at wolfram.com
On Fri, 28 Oct 2005 07:49:11 +0000 (UTC), Carl K. Woll <carlw at wolfram.com>  
wrote:

> I decided to compare my approach and another which I thought up
> previously with Maxim's approach in a bit more detail below. My
> conclusion was that it should be possible to do things quite a bit
> quicker using SparseArray objects instead of using the
> ConnectedComponents function from Combinatorica.
>
> Maxim wrote:
>> On Sun, 23 Oct 2005 10:11:07 +0000 (UTC), Carl K. Woll
>> <carlw at u.washington.edu> wrote:
>>
>>
>>> leigh pascoe wrote:
>>>
>>>> Dear Mathgroup,
>>>>
>>>> I would like to construct a function to transform a list of integers
>>>> into a list of lists of related members. Suppose we have the finite  
>>>> list
>>>>
>>>> {x1,x2,x3,..........xN}
>>>>
>>>> and a test function ( relatedQ[x_,y_,n_] ) that decides whether two
>>>> elements are related. The actual test may vary, but returns True or
>>>> False for any two elements as a function of n (n is an independent
>>>> variable not the number of elements in the list}.
>>>>
>>>> I would like to process this list to aggregate those members that are
>>>> related either directly to each other or to a common element. For
>>>> example if x1 is unrelated to any other element in the list, x2 is
>>>> related to x5 but not x7 and x5 is related to x7 etc. would produce  
>>>> the
>>>> list of lists
>>>>
>>>> {{x1},{x2,x5,x7),.......}
>>>>
>>>> To take a specific example, assume we have the list
>>>>
>>>> list={1,2,6,7,8,9,11,16}
>>>>
>>>> and that application of the relatedQ, test to all possible pairs  
>>>> yields
>>>> the list of related pairs
>>>>
>>>> prlist={{1,2},{1,6},{2,6},{2,7},{6,7},{6,11},{7,8},{7,11},{11,16}}
>>>>
>>>> Using the list and the list of related pairs we can deduce the desired
>>>> list of lists as
>>>>
>>>> {{1,2,6,7,8,11,16},{9}}
>>>>
>>>
>>> Here's one suboptimal method. I give steps tailored for your example.  
>>> It
>>> shouldn't be too difficult encapsulate these steps into a nice  
>>> function.
>>> Here are the steps. Construct a transition matrix:
>>>
>>> sp1 = SparseArray[Thread[prlist -> 1], {16, 16}];
>>> sp2 = SparseArray[{#, #} -> 1 & /@ list, {16, 16}];
>>>
>>> In[13]:=
>>> t = sp1 + sp2 + Transpose[sp1]
>>> Out[13]=
>>> SparseArray[<24>, {16, 16}]
>>>
>>> Multiply t by itself (and take it's sign) until the result doesn't
>>> change:
>>>
>>> In[14]:=
>>> FixedPoint[Sign[#.#]&,t]
>>> Out[14]=
>>> SparseArray[<50>, {16, 16}]
>>>
>>> The different rows of the fixed point are the different sets you are
>>> interested in:
>>>
>>> In[15]:=
>>> Union[Normal[%14]]
>>> Out[15]=
>>> {{0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0},
>>>
>>> {0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0},
>>>
>>> {1, 1, 0, 0, 0, 1, 1, 1, 0, 0, 1, 0, 0, 0, 0, 1}}
>>>
>>> If you prefer a list of integer sets:
>>>
>>> In[16]:=
>>> Flatten[Position[#,1]]&/@%15
>>> Out[16]=
>>> {{}, {9}, {1, 2, 6, 7, 8, 11, 16}}
>>>
>>>
>>>> If 6 were not in the original list we would get instead
>>>>
>>>> {{1,2},{7,8,11,16},{9}}
>>>>
>>>
>>> Since your original pair list had the pair {2,7}, I think the result
>>> should
>>> still be {{1,2,7,8,11,16},{9}}. At any rate, to use the transition  
>>> matrix
>>> approach, you would need to delete all pairs that had elements not in
>>> your
>>> list, and then look for the fixed point. In your example with 6 deleted
>>> from
>>> list, you would also need to delete all 6s from prlist:
>>>
>>> newprlist = DeleteCases[prlist,{6,_}|{_,6}]
>>>
>>> and then proceed as before using newprlist instead of prlist. This  
>>> method
>>> does extra work based on how large your final integer sets are.  
>>> However,
>>> matrix multiplication of sparse matrices is extremely quick, so I think
>>> it
>>> ought to be at least competitive with other approaches.
>>>
>>> Carl Woll
>>> Wolfram Research
>>>
>>
>>
>> If I understand the problem correctly, this amounts to finding connected
>> components in a graph -- there was a similar question asked on  
>> MathGroup a
>> while ago.
>>
>> In[1]:=
>> <<discretemath`
>> agg[L_, $Lpair_] := Module[
>>    {Lpair = $Lpair, LLind},
>>    Lpair = Lpair /. Thread[L -> Range@ Length@ L];
>>    LLind = ConnectedComponents[
>>      FromUnorderedPairs[Lpair, CircularEmbedding[Length@ L]]];
>>    Extract[L, List /@ #]& /@ LLind
>> ]
>>
>> In[3]:=
>> agg[{1, 2, 6, 7, 8, 9, 11, 16},
>>      {{1, 2}, {1, 6}, {2, 6}, {2, 7}, {6, 7},
>>       {6, 11}, {7, 8}, {7, 11}, {11, 16}}]
>>
>> Out[3]=
>> {{1, 2, 6, 7, 8, 11, 16}, {9}}
>>
>> In[4]:=
>> agg2[L_, Lpair_] := Module[
>>    {n = Max@ L, sp1, sp2, t},
>>    sp1 = SparseArray[Thread[Lpair -> 1], {n, n}];
>>    sp2 = SparseArray[{#, #} -> 1& /@ L, {n, n}];
>>    t = sp1 + sp2 + Transpose@ sp1;
>>    Flatten@ Position[#, 1]& /@ Union@ Normal@
>>        FixedPoint[Sign[#.#]&, t, SameTest -> Equal] //
>>      DeleteCases[#, {}]&
>> ]
>>
>> In[5]:=
>> n = 500;
>> pairs = Array[Random[Integer, {1, n}]&, {n, 2}];
>> (ans = agg[Range@ n, pairs];) // Timing
>> (ans2 = agg2[Range@ n, pairs];) // Timing
>> Sort@ ans === Sort@ ans2
>>
>> Out[7]= {0.062 Second, Null}
>>
>> Out[8]= {2.469 Second, Null}
>>
>> Out[9]= True
>>
>> The matrix multiplication method slows down significantly on the last
>> iterations of FixedPoint. For this choice of pairs there will typically  
>> be
>> one large connected subgraph with about 400 vertices, so the sparse  
>> array
>> will ultimately contain more than 400^2 non-zero elements and Dot won't  
>> be
>> as fast.
>>
>> A subtle point is that SameQ checks only the structural equivalence of
>> SparseArray expressions. A sparse array contains information about the
>> positions of the non-zero elements, but those positions can be stored  
>> out
>> of order -- it can be {{2}, {1}} as well as {{1}, {2}}, and then SameQ
>> will give False. Here's an example:
>>
>> agg2[{1, 2, 3, 4}, {{1, 4}, {4, 3}, {3, 1}}]
>>
>> If we use FixedPoint with SameTest -> Automatic, it will loop  
>> infinitely.
>>
>> Maxim Rytin
>> m.r at inbox.ru
>
> After I proposed my solution, I realized that for input which creates
> large components my solution was very inefficient, as you state above. A
> simple remedy is to work with a single element, and find each component
> by using matrix vector multiplication. In fact, I believe something like
> this is exactly what ConnectedComponents from Combinatorica does.
>
> It still seems to me that using SparseArray objects ought to be quicker.
> Here is your function, modified to assume that the n elements are the
> integers from 1 to n, as suggested by Danny Lichtblau:
>
> agg0[n_, Lpair_] := Module[{LLind},
>    LLind = ConnectedComponents[
>        FromUnorderedPairs[Lpair, CircularEmbedding[n]]];
>    Sort[Extract[Range[n], List /@ #] & /@ LLind]]
>
> Here is my original idea, modified to use SameTest->Equal as you mention
> above:
>
> agg1[n_, pairs_] := Module[{sp, t},
>    sp = SparseArray[Thread[pairs -> 1], {n, n}];
>    t = Sign[sp + SparseArray[{i_, i_} -> 1, {n, n}] + Transpose[sp]];
>    Union[nonzeros /@ FixedPoint[Sign[#.#]& , t, SameTest -> Equal]]
>    ]
>
> I use the following helper function:
>
> nonzeros[a_SparseArray] := a /. SparseArray[_, _, _, x_] :>
>    Flatten[x[[2, 2]]]
>
> Finally, here is a function which finds each component one at a time to
> avoid SparseArray objects which aren't really sparse. Basically, the
> algorithm of ConnectedComponents using SparseArray objects:
>
> agg2[n_, pairs_] := Module[{sp, t, candidates, nextcomp},
>    sp = SparseArray[Thread[pairs -> 1], {n, n}];
>    t = Sign[sp + SparseArray[{i_, i_} -> 1, {n, n}] + Transpose[sp]];
>    candidates = Range[n];
>    Sort[
>      Reap[
>        While[
>          candidates != {},
>          v = SparseArray[candidates[[1]] -> 1, n];
>          nextcomp = nonzeros[FixedPoint[Sign[t.#1]&,v,SameTest->Equal]];
>          candidates = Complement[candidates, Sow[nextcomp]];
>          ];
>      ][[2, 1]]
>     ]
>    ]
>
> Let's do some comparisons:
>
> In[56]:=
> n=100;
> p=Table[Random[Integer,{1,n}],{n},{2}];
>
> In[62]:=
> r1=agg0[n,p];//Timing
> r2=agg1[n,p];//Timing
> r3=agg2[n,p];//Timing
> r1===r2===r3
>
> Out[62]=
> {0. Second,Null}
>
> Out[63]=
> {0.031 Second,Null}
>
> Out[64]=
> {0. Second,Null}
>
> Out[65]=
> True
>
> Good they all agree. Now, for a larger sample set:
>
> In[69]:=
> n=3000;
> p=Table[Random[Integer,{1,n}],{n},{2}];
>
> In[71]:=
> r1=agg0[n,p];//Timing
> (*r2=agg1[n,p];//Timing*)
> r3=agg2[n,p];//Timing
> r1===r3
>
> Out[71]=
> {1.015 Second,Null}
>
> Out[72]=
> {0.282 Second,Null}
>
> Out[73]=
> True
>
> We see that agg2 is about 4 times faster than agg0 here. I don't bother
> to try agg1 because it probably will never finish. Some statistics on
> the sample set:
>
> In[76]:=
> {Max[Length/@r1],Length[r1]}
>
> Out[76]=
> {2410,473}
>
> The largest component had 2410 elements (so agg1 will be horrible), but
> there were 473 components in total. Basically, one large component, and
> lots of little components. If we increase the density of the transition
> matrix:
>
> In[77]:=
> n=3000;
> p=Table[Random[Integer,{1,n}],{3n},{2}];
>
> In[79]:=
> r1=agg0[n,p];//Timing
> (*r2=agg1[n,p];//Timing*)
> r3=agg2[n,p];//Timing
> r1===r3
>
> Out[79]=
> {1.625 Second,Null}
>
> Out[80]=
> {0.094 Second,Null}
>
> Out[81]=
> True
>
> Now, agg2 is 16 times faster than agg0. Here the statistics are:
>
> In[82]:=
> {Max[Length/@r1],Length[r1]}
>
> Out[82]=
> {2991,10}
>
> So, we had one large component, and a few small components. Finally lets
>   decrease the density of the transition matrix:
>
> In[83]:=
> n=3000;
> p=Table[Random[Integer,{1,n}],{n/10},{2}];
>
> In[85]:=
> r1=agg0[n,p];//Timing
> r2=agg1[n,p];//Timing
> r3=agg2[n,p];//Timing
> r1===r2===r3
>
> Out[85]=
> {1.89 Second,Null}
>
> Out[86]=
> {0.094 Second,Null}
>
> Out[87]=
> {2.219 Second,Null}
>
> Out[88]=
> True
>
> This time we do try agg1, and agg1 is by far the fastest. Also, agg0 is
> now a bit faster than agg2. The statistics of the sample set:
>
> In[89]:=
> {Max[Length/@r1],Length[r1]}
>
> Out[89]=
> {6,2700}
>
> Since the largest component only had 6 elements, agg1 only needed to do
> 3 sparse multiplications ((t^2)^2)^2), and consequently it is much
> faster. We also notice that agg0 is faster than agg2, primarily because
> the overhead of setting up the SparseArrays 2700 times begins to take a
> toll.
>
> What this suggests is that it ought to be possible to combine the
> approach of agg1 and agg2 to do sparse matrix multiplications until the
> matrix is no longer sparse, and then switch over to matrix vector
> multiplication. Such a combined approach ought to be much quicker than
> one using ConnectedComponents. An algorithm to do so might be something
> like the following:
>
> 1. Evaluate (t^2)^2 etc. until the matrix is no longer 10% (or some
> other sparsity measure) sparse.
>
> 2. Find rows which did not change during the last iteration. These rows
> are the components which have already been discovered. Remember them,
> and remove them from the transition matrix.
>
> 3. Use sparse matrix vector multiplication as in agg2 to find all of the
> remaining large components.
>
> Steps 1 and 2 may be combined.
>
> Carl Woll
> Wolfram Research
>
>

ConnectedComponents has better asymptotics for sparse graphs. The  
connected components in a graph can be found in O(n + m) time, where n is  
the number of vertices and m is the number of edges. So for sparse graphs  
the time should grow linearly:

In[3]:=
agg3[n_, Lpair_] := Module[
   {dfs, LLadj, LvisitF, cc, cur, h, ans},
   dfs[v_] :=
     (LvisitF[[v]] = 1;
      cc = {cc, v};
      If[LvisitF[[#]] == 0, dfs[#]]& /@
        LLadj[[v]];);
   LLadj = Array[{}&, n];
   (LLadj[[#]] = {LLadj[[#]], #2};
    LLadj[[#2]] = {LLadj[[#2]], #}
   )& @@@ Lpair;
   LLadj = Flatten /@ LLadj;
   LvisitF = Array[0&, n];
   cur = 0;
   ans = h[];
   Block[{$RecursionLimit = Infinity},
     While[cur++ < n,
       If[LvisitF[[cur]] == 0,
         cc = {};
         dfs[cur];
         ans = h[ans, cc]
   ]]];
   List @@ Flatten /@ Flatten[ans, Infinity, h]
]

In[4]:=
n = 10000;
pairs = Array[Random[Integer, {1, n}]&, {n, 2}];
t1 = {Timing[agg2[n, pairs];], Timing[agg3[n, pairs];]}[[All, 1]]
n = 20000;
pairs = Array[Random[Integer, {1, n}]&, {n, 2}];
t2 = {Timing[agg2[n, pairs];], Timing[agg3[n, pairs];]}[[All, 1]]
t2/t1

Out[6]= {1.032*Second, 0.281*Second}

Out[9]= {4.578*Second, 0.547*Second}

Out[10]= {4.4360465, 1.9466192}

agg2 isn't up to par here, because it seems to perform as O(n^2) or worse.  
It does show linear behaviour when the number of edges is around 10*n  
(well, roughly linear, because Complement is already n*Log[n]). For dense  
graphs both methods are quadratic, but matrix multiplication is faster.

The ConnectedComponents routine from Combinatorica uses the same  
depth-first search algorithm, so it should have the same O(n + m)  
complexity, but it doesn't (partially due to abundant use of AppendTo).

A surprising discovery is that if we take agg3 and modify LvisitF to be a  
vector of False/True instead of 0/1, agg3 won't be linear time anymore. A  
list of booleans cannot be a packed array, but it is still an ordinary  
Mathematica list, which is basically a fixed size array of pointers, so  
theoretically it ought to have constant access time. The problem is that  
list elements may have upvalues associated with them, and in principle  
after an element of the list has changed the whole structure must be  
reevaluated, in case an upvalue associated with another element takes  
effect now. Mathematica uses various optimizations to avoid the  
reevaluation whenever possible, but it seems that for lists of symbols  
(including True/False) there's still some overhead.

Maxim Rytin
m.r at inbox.ru
References:
- Re: aggregation of related elements in a list
  - From: Maxim <ab_def@prontomail.com>
Prev by Date: Re: SymbolName question
Next by Date: Exporting notebooks to xml (cnxml)
Previous by thread: Re: Re: Re: Re: aggregation of related elements in a list
Next by thread: Apply and up/down value questions