MathGroup Archive: October 2005 [00894]

[Date Index] [Thread Index] [Author Index]
Re: Re: aggregation of related elements in a list
To: mathgroup at smc.vnet.net
Subject: [mg61762] Re: [mg61730] Re: aggregation of related elements in a list
From: "Carl K. Woll" <carlw at wolfram.com>
Date: Fri, 28 Oct 2005 03:25:55 -0400 (EDT)
References: <djcgcs$6du$1@smc.vnet.net> <djfnjr$b1i$1@smc.vnet.net> <200510270902.FAA19490@smc.vnet.net>
Sender: owner-wri-mathgroup at wolfram.com
I decided to compare my approach and another which I thought up
previously with Maxim's approach in a bit more detail below. My
conclusion was that it should be possible to do things quite a bit
quicker using SparseArray objects instead of using the
ConnectedComponents function from Combinatorica.

Maxim wrote:
> On Sun, 23 Oct 2005 10:11:07 +0000 (UTC), Carl K. Woll  
> <carlw at u.washington.edu> wrote:
> 
> 
>>leigh pascoe wrote:
>>
>>>Dear Mathgroup,
>>>
>>>I would like to construct a function to transform a list of integers
>>>into a list of lists of related members. Suppose we have the finite list
>>>
>>>{x1,x2,x3,..........xN}
>>>
>>>and a test function ( relatedQ[x_,y_,n_] ) that decides whether two
>>>elements are related. The actual test may vary, but returns True or
>>>False for any two elements as a function of n (n is an independent
>>>variable not the number of elements in the list}.
>>>
>>>I would like to process this list to aggregate those members that are
>>>related either directly to each other or to a common element. For
>>>example if x1 is unrelated to any other element in the list, x2 is
>>>related to x5 but not x7 and x5 is related to x7 etc. would produce the
>>>list of lists
>>>
>>>{{x1},{x2,x5,x7),.......}
>>>
>>>To take a specific example, assume we have the list
>>>
>>>list={1,2,6,7,8,9,11,16}
>>>
>>>and that application of the relatedQ, test to all possible pairs yields
>>>the list of related pairs
>>>
>>>prlist={{1,2},{1,6},{2,6},{2,7},{6,7},{6,11},{7,8},{7,11},{11,16}}
>>>
>>>Using the list and the list of related pairs we can deduce the desired
>>>list of lists as
>>>
>>>{{1,2,6,7,8,11,16},{9}}
>>>
>>
>>Here's one suboptimal method. I give steps tailored for your example. It
>>shouldn't be too difficult encapsulate these steps into a nice function.
>>Here are the steps. Construct a transition matrix:
>>
>>sp1 = SparseArray[Thread[prlist -> 1], {16, 16}];
>>sp2 = SparseArray[{#, #} -> 1 & /@ list, {16, 16}];
>>
>>In[13]:=
>>t = sp1 + sp2 + Transpose[sp1]
>>Out[13]=
>>SparseArray[<24>, {16, 16}]
>>
>>Multiply t by itself (and take it's sign) until the result doesn't  
>>change:
>>
>>In[14]:=
>>FixedPoint[Sign[#.#]&,t]
>>Out[14]=
>>SparseArray[<50>, {16, 16}]
>>
>>The different rows of the fixed point are the different sets you are
>>interested in:
>>
>>In[15]:=
>>Union[Normal[%14]]
>>Out[15]=
>>{{0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0},
>>
>>{0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0},
>>
>>{1, 1, 0, 0, 0, 1, 1, 1, 0, 0, 1, 0, 0, 0, 0, 1}}
>>
>>If you prefer a list of integer sets:
>>
>>In[16]:=
>>Flatten[Position[#,1]]&/@%15
>>Out[16]=
>>{{}, {9}, {1, 2, 6, 7, 8, 11, 16}}
>>
>>
>>>If 6 were not in the original list we would get instead
>>>
>>>{{1,2},{7,8,11,16},{9}}
>>>
>>
>>Since your original pair list had the pair {2,7}, I think the result  
>>should
>>still be {{1,2,7,8,11,16},{9}}. At any rate, to use the transition matrix
>>approach, you would need to delete all pairs that had elements not in  
>>your
>>list, and then look for the fixed point. In your example with 6 deleted  
>>from
>>list, you would also need to delete all 6s from prlist:
>>
>>newprlist = DeleteCases[prlist,{6,_}|{_,6}]
>>
>>and then proceed as before using newprlist instead of prlist. This method
>>does extra work based on how large your final integer sets are. However,
>>matrix multiplication of sparse matrices is extremely quick, so I think  
>>it
>>ought to be at least competitive with other approaches.
>>
>>Carl Woll
>>Wolfram Research
>>
> 
> 
> If I understand the problem correctly, this amounts to finding connected  
> components in a graph -- there was a similar question asked on MathGroup a  
> while ago.
> 
> In[1]:=
> <<discretemath`
> agg[L_, $Lpair_] := Module[
>    {Lpair = $Lpair, LLind},
>    Lpair = Lpair /. Thread[L -> Range@ Length@ L];
>    LLind = ConnectedComponents[
>      FromUnorderedPairs[Lpair, CircularEmbedding[Length@ L]]];
>    Extract[L, List /@ #]& /@ LLind
> ]
> 
> In[3]:=
> agg[{1, 2, 6, 7, 8, 9, 11, 16},
>      {{1, 2}, {1, 6}, {2, 6}, {2, 7}, {6, 7},
>       {6, 11}, {7, 8}, {7, 11}, {11, 16}}]
> 
> Out[3]=
> {{1, 2, 6, 7, 8, 11, 16}, {9}}
> 
> In[4]:=
> agg2[L_, Lpair_] := Module[
>    {n = Max@ L, sp1, sp2, t},
>    sp1 = SparseArray[Thread[Lpair -> 1], {n, n}];
>    sp2 = SparseArray[{#, #} -> 1& /@ L, {n, n}];
>    t = sp1 + sp2 + Transpose@ sp1;
>    Flatten@ Position[#, 1]& /@ Union@ Normal@
>        FixedPoint[Sign[#.#]&, t, SameTest -> Equal] //
>      DeleteCases[#, {}]&
> ]
> 
> In[5]:=
> n = 500;
> pairs = Array[Random[Integer, {1, n}]&, {n, 2}];
> (ans = agg[Range@ n, pairs];) // Timing
> (ans2 = agg2[Range@ n, pairs];) // Timing
> Sort@ ans === Sort@ ans2
> 
> Out[7]= {0.062 Second, Null}
> 
> Out[8]= {2.469 Second, Null}
> 
> Out[9]= True
> 
> The matrix multiplication method slows down significantly on the last  
> iterations of FixedPoint. For this choice of pairs there will typically be  
> one large connected subgraph with about 400 vertices, so the sparse array  
> will ultimately contain more than 400^2 non-zero elements and Dot won't be  
> as fast.
> 
> A subtle point is that SameQ checks only the structural equivalence of  
> SparseArray expressions. A sparse array contains information about the  
> positions of the non-zero elements, but those positions can be stored out  
> of order -- it can be {{2}, {1}} as well as {{1}, {2}}, and then SameQ  
> will give False. Here's an example:
> 
> agg2[{1, 2, 3, 4}, {{1, 4}, {4, 3}, {3, 1}}]
> 
> If we use FixedPoint with SameTest -> Automatic, it will loop infinitely.
> 
> Maxim Rytin
> m.r at inbox.ru

After I proposed my solution, I realized that for input which creates
large components my solution was very inefficient, as you state above. A
simple remedy is to work with a single element, and find each component
by using matrix vector multiplication. In fact, I believe something like
this is exactly what ConnectedComponents from Combinatorica does.

It still seems to me that using SparseArray objects ought to be quicker.
Here is your function, modified to assume that the n elements are the
integers from 1 to n, as suggested by Danny Lichtblau:

agg0[n_, Lpair_] := Module[{LLind},
   LLind = ConnectedComponents[
       FromUnorderedPairs[Lpair, CircularEmbedding[n]]];
   Sort[Extract[Range[n], List /@ #] & /@ LLind]]

Here is my original idea, modified to use SameTest->Equal as you mention
above:

agg1[n_, pairs_] := Module[{sp, t},
   sp = SparseArray[Thread[pairs -> 1], {n, n}];
   t = Sign[sp + SparseArray[{i_, i_} -> 1, {n, n}] + Transpose[sp]];
   Union[nonzeros /@ FixedPoint[Sign[#.#]& , t, SameTest -> Equal]]
   ]

I use the following helper function:

nonzeros[a_SparseArray] := a /. SparseArray[_, _, _, x_] :>
   Flatten[x[[2, 2]]]

Finally, here is a function which finds each component one at a time to
avoid SparseArray objects which aren't really sparse. Basically, the
algorithm of ConnectedComponents using SparseArray objects:

agg2[n_, pairs_] := Module[{sp, t, candidates, nextcomp},
   sp = SparseArray[Thread[pairs -> 1], {n, n}];
   t = Sign[sp + SparseArray[{i_, i_} -> 1, {n, n}] + Transpose[sp]];
   candidates = Range[n];
   Sort[
     Reap[
       While[
         candidates != {},
         v = SparseArray[candidates[[1]] -> 1, n];
         nextcomp = nonzeros[FixedPoint[Sign[t.#1]&,v,SameTest->Equal]];
         candidates = Complement[candidates, Sow[nextcomp]];
         ];
     ][[2, 1]]
    ]
   ]

Let's do some comparisons:

In[56]:=
n=100;
p=Table[Random[Integer,{1,n}],{n},{2}];

In[62]:=
r1=agg0[n,p];//Timing
r2=agg1[n,p];//Timing
r3=agg2[n,p];//Timing
r1===r2===r3

Out[62]=
{0. Second,Null}

Out[63]=
{0.031 Second,Null}

Out[64]=
{0. Second,Null}

Out[65]=
True

Good they all agree. Now, for a larger sample set:

In[69]:=
n=3000;
p=Table[Random[Integer,{1,n}],{n},{2}];

In[71]:=
r1=agg0[n,p];//Timing
(*r2=agg1[n,p];//Timing*)
r3=agg2[n,p];//Timing
r1===r3

Out[71]=
{1.015 Second,Null}

Out[72]=
{0.282 Second,Null}

Out[73]=
True

We see that agg2 is about 4 times faster than agg0 here. I don't bother
to try agg1 because it probably will never finish. Some statistics on
the sample set:

In[76]:=
{Max[Length/@r1],Length[r1]}

Out[76]=
{2410,473}

The largest component had 2410 elements (so agg1 will be horrible), but
there were 473 components in total. Basically, one large component, and
lots of little components. If we increase the density of the transition
matrix:

In[77]:=
n=3000;
p=Table[Random[Integer,{1,n}],{3n},{2}];

In[79]:=
r1=agg0[n,p];//Timing
(*r2=agg1[n,p];//Timing*)
r3=agg2[n,p];//Timing
r1===r3

Out[79]=
{1.625 Second,Null}

Out[80]=
{0.094 Second,Null}

Out[81]=
True

Now, agg2 is 16 times faster than agg0. Here the statistics are:

In[82]:=
{Max[Length/@r1],Length[r1]}

Out[82]=
{2991,10}

So, we had one large component, and a few small components. Finally lets
  decrease the density of the transition matrix:

In[83]:=
n=3000;
p=Table[Random[Integer,{1,n}],{n/10},{2}];

In[85]:=
r1=agg0[n,p];//Timing
r2=agg1[n,p];//Timing
r3=agg2[n,p];//Timing
r1===r2===r3

Out[85]=
{1.89 Second,Null}

Out[86]=
{0.094 Second,Null}

Out[87]=
{2.219 Second,Null}

Out[88]=
True

This time we do try agg1, and agg1 is by far the fastest. Also, agg0 is
now a bit faster than agg2. The statistics of the sample set:

In[89]:=
{Max[Length/@r1],Length[r1]}

Out[89]=
{6,2700}

Since the largest component only had 6 elements, agg1 only needed to do
3 sparse multiplications ((t^2)^2)^2), and consequently it is much
faster. We also notice that agg0 is faster than agg2, primarily because
the overhead of setting up the SparseArrays 2700 times begins to take a
toll.

What this suggests is that it ought to be possible to combine the
approach of agg1 and agg2 to do sparse matrix multiplications until the
matrix is no longer sparse, and then switch over to matrix vector
multiplication. Such a combined approach ought to be much quicker than
one using ConnectedComponents. An algorithm to do so might be something
like the following:

1. Evaluate (t^2)^2 etc. until the matrix is no longer 10% (or some
other sparsity measure) sparse.

2. Find rows which did not change during the last iteration. These rows
are the components which have already been discovered. Remember them,
and remove them from the transition matrix.

3. Use sparse matrix vector multiplication as in agg2 to find all of the
remaining large components.

Steps 1 and 2 may be combined.

Carl Woll
Wolfram Research
Follow-Ups:
- Re: Re: Re: aggregation of related elements in a list
  - From: danl@wolfram.com
References:
- Re: aggregation of related elements in a list
  - From: Maxim <ab_def@prontomail.com>
Prev by Date: Re: Integrate vs Nintegrate for impulsive functions
Next by Date: graphing x^2+4 on x, y, and i
Previous by thread: Re: aggregation of related elements in a list
Next by thread: Re: Re: Re: aggregation of related elements in a list