MathGroup Archive 2002

[Date Index] [Thread Index] [Author Index]

Search the Archive

RE: Efficient Sorting Algorithm ++ SortSplit1 and Merge

  • To: mathgroup at smc.vnet.net
  • Subject: [mg38307] RE: Efficient Sorting Algorithm ++ SortSplit1 and Merge
  • From: "Wolf, Hartmut" <Hartmut.Wolf at t-systems.com>
  • Date: Thu, 12 Dec 2002 01:31:51 -0500 (EST)
  • Sender: owner-wri-mathgroup at wolfram.com

Dear Allan,

assuming you have posted this to MathGroup, I reply in public.

>-----Original Message-----
>From: Allan Hayes [mailto:hay at haystack.demon.co.uk]
To: mathgroup at smc.vnet.net
>Sent: Friday, December 06, 2002 9:49 PM
>To: Wolf, Hartmut; Brian Higgins
>Subject: [mg38307] Re: Efficient Sorting Algorithm
>
>
>
>I give below some further speed-ups.
>The improvement, at least on  the data I used, is due to using
>Split[Sort[_]] right at the start to parcel the data - it probably depends
>on there being a lot of repetition of the first term of the entries in the
>data.

Yes, the idea is, to exploit all work already done, and
Composition[Split,Sort]
appears to be a fundamental design pattern for the language!


>The code for Matchings7 below is faster than Daniel Lichblau's recently
>posted code - though he suggests that this might be speeded up by
>substituting numbers for strings and compiling.
>
>Data
>
>    s1 = Table[{FromCharacterCode[Table[Random[Integer, {65, 69}], {4}]],
>            Random[Integer, {1, 2000}]}, {6000}];
>    s2 = Table[{FromCharacterCode[Table[Random[Integer, {65, 69}], {4}]],
>            Random[Integer, {1, 2000}]}, {12000}];
>
>Get the members of s1 with the same first entry as some member of s2.
>
>Previous code (preserves order)
>
>    Matched1[s1_,s2_]:=
>      Cases[ s1,{Alternatives@@Union[s2[[All,1]]],_}]
>
>    Matched1[s1,s2];//Timing
>
>        {3.4 Second,Null}
>
>New code (does not preserve order) - note the use of  
>s1[[Ordering[s1[[All, 1]]]]] to save on ordering with respect to the second
entries
>
>    Matched4[s1_,s2_]:=
>          Cases[Split[s1[[Ordering[s1[[All,1]]]]], #1[[1]]===#2[[1]]&],
>                {{Alternatives@@Union[s2[[All,1]]],_},___}]
>
>    Matched4[s1,s2];//Timing
>
>    {1.6 Second,Null}
>

This is very pleasing, and a general pattern if I understand
Union[s2[[All,1]]] as the sorted list of keys searched. 

A slight improvement is still possible though. We suffer from a performance
penalty imposed by the test function for Split. I had already published how
to overcome this, here now recalled in a procedure:

Attributes[SortSplit1] = {HoldFirst};

SortSplit1[s_] := Module[{ord = Ordering[s[[All, 1]]], secs},
    secs = FoldList[Plus, 0, Length /@ Split[s[[ord, 1]] ]]; 
    s[[Take[ord, #]]] & /@ Transpose[{Drop[secs + 1, -1], Drop[secs, 1]}]]

ssb = SortSplit1[s1]; // Timing
{0.271 Second, Null}

whereas 

ssa = Split[s1[[Ordering[s1[[All, 1]]]]], #1[[1]] === #2[[1]] &]; // Timing
{0.571 Second, Null}

In[19]:= ssa == ssb
Out[19]= True


The Hold-Attribute for SortSplit1 is essential for its performance! If you
don't want to give that, you alternatively may make a local copy of the
argument as in

SortSplit1x[s1_] := 
  Module[{s = s1}, 
    Module[{ord = Ordering[s[[All, 1]]]}, 
      s[[Take[ord, #]]] & /@ 
        With[{secs = FoldList[Plus, 0, Length /@ Split[s[[ord, 1]] ]]}, 
          Transpose[{Drop[secs + 1, -1], Drop[secs, 1]}]]]]

ssx = SortSplit1x[s1]; // Timing
{0.301 Second, Null}

In[56]:= ssx == ssa
Out[56]= True


I observe this, yet my understanding of Mathematica is not deep enough to
explain it, let alone use this to deduce, so to speek, improved coding. Such
I have to store it in my brain as an ad-hoc rule of surmise. I would be very
pleased to have that clarified.


With this we now may tweak Matched4 a bit further:

Attributes[Matched4bis] = {HoldAll};

Matched4bis[s1_, s2_] := 
  Cases[SortSplit1[s1], {{Alternatives @@ Union[s2[[All, 1]]], _}, ___}]

(The need to Hold propagates! Not for SortSplit1x of course)


Matched4x[s1_, s2_] := 
  Cases[SortSplit1x[s1], {{Alternatives @@ Union[s2[[All, 1]]], _}, ___}]


In[58]:= Matched4[s1, s2]; // Timing
Out[58]= {1.061 Second, Null}

In[59]:= Matched4bis[s1, s2]; // Timing
Out[59]= {0.761 Second, Null}

In[60]:= Matched4x[s1, s2]; // Timing
Out[60]= {0.781 Second, Null}


>
>Get the full matchings (this code will work on {s1,s2,....,sn}, as well as
>just {s1, s2})
>
>   Matchings7[s_]:=
>    Module[{st,pt,sp},
>      st = #[[Ordering[#[[All,1]]]]]&/@s;
>      sp=Split[#,#1[[1]]===#2[[1]]&]&/@st;
>      pt= (Alternatives@@(Intersection@@ st[[All,All,1]]));
>      Transpose[Cases[#,{{pt,_},___}]&/@sp]
>      ];
>
>    (ms7=Matchings7[{s1,s2}]);//Timing
>
>        {3.46 Second,Null}
>

I didn't manage to introduce SortSplit1 into Matchings7 (the Hold mechanisms
seems to break down when Mapping SortSplit1 over the argument), however it
works with SortSplit1x:

Matchings7x[s_] := Module[{pt, sp},
      sp = SortSplit1x /@ s;
      pt = (Alternatives @@ (Intersection @@ s[[All, All, 1]]));
      Transpose[Cases[#, {{pt, _}, ___}] & /@ sp]];

In[173]:= ms7x = Matchings7x[{s1, s2}]; // Timing
Out[173]= {1.652 Second, Null}

In[174]:= ms7 = Matchings7[{s1, s2}]; // Timing
Out[174]= {2.364 Second, Null}

In[175]:= ms7ax == ms7
Out[175]= True


Well, I just made it (writing calls upon thinking):

Attributes[Matchings7a] = {HoldAll};

Matchings7a[s_] := Module[{pt, sp},
      sp = SortSplit1 /@ Unevaluated[s];
      pt = (Alternatives @@ (Intersection @@ s[[All, All, 1]]));
      Transpose[Cases[#, {{pt, _}, ___}] & /@ sp]];

In[227]:= ms7a = Matchings7a[{s1, s2}]; // Timing
Out[227]= {1.572 Second, Null}


>
>Daniel Lichtblau (gives essentially the same information as Matchings7)
>
>    myTest[l1_, l2_] :=
>    Module[{s1, s2, m = Length[l1], n = Length[l2], j, k, res = {}, ord },
>        s1 = Sort[l1]; s2 = Sort[l2];
>        For[j = 1; k = 1, j <= m && k <= n, Null,
>             ord = Order[s1[[j,1]], s2[[k,1]]];
>             If[ord == 1, j++; Continue[]];
>             If[ord == -1, k++; Continue[]];
>             res = {res, {s1[[j]], s2[[k]]}};
>             j++; k++;
>        ];
>        Partition[Partition[Flatten[res], 2], 2]
>    ]
>
>    myTest[s1,s2];//Timing
>
>        {7.47 Second,Null}
>

I do not agree, as to this containing the same information, see:

In[239]:= my = myTest[s1, s2]; // Timing
Out[239]= {5.578 Second, Null}

In[242]:= Length /@ {my, ms7}
Out[242]= {5944, 625}

In[243]:= Composition[Length, Union, Flatten] /@ {my, ms7}
Out[243]= {2592, 2624}

In[244]:= Length /@ {my, Flatten[ms7, 1]}
Out[244]= {5944, 1250}


Or compare

In[248]:= ms7[[1]]
Out[248]=
{{{"AAAA", 747}, {"AAAA", 1580}, {"AAAA", 1929}, {"AAAA", 1277}}, {{"AAAA", 
      1241}, {"AAAA", 658}, {"AAAA", 141}, {"AAAA", 1567}, {"AAAA", 
      371}, {"AAAA", 861}, {"AAAA", 1963}, {"AAAA", 1607}, {"AAAA", 
      316}, {"AAAA", 1674}, {"AAAA", 1501}, {"AAAA", 1749}, {"AAAA", 
      1777}, {"AAAA", 172}, {"AAAA", 568}, {"AAAA", 1752}}}

In[250]:= Take[my, 10]
Out[250]=
{{{"AAAA", 747}, {"AAAA", 141}}, {{"AAAA", 1277}, {"AAAA", 172}}, {{"AAAA", 
      1580}, {"AAAA", 316}}, {{"AAAA", 1929}, {"AAAA", 371}}, {{"AAAB", 
      417}, {"AAAB", 18}}, {{"AAAB", 1521}, {"AAAB", 332}}, {{"AAAB", 
      1526}, {"AAAB", 396}}, {{"AAAB", 1693}, {"AAAB", 525}}, {{"AAAB", 
      1961}, {"AAAB", 664}}, {{"AAAB", 1976}, {"AAAB", 827}}}

Quite a lot is missing here! This is caused by the fact that at match, when
Order[...] == 0, both pointers j, k are incremented, such elements from the
longer sequence at this (matching) key will disappear.


I had already communicated a procedural solution, here adapted to your form
of output:

myMatch4[s1_, s2_] := 
  Module[{ss1 = Sort[s1], ss2 = Sort[s2], keys, rr1, rr2},
    keys = Intersection[ss1[[All, 1]], ss2[[All, 1]]];
    rr1 = getLabelled[ss1, keys];
    rr2 = getLabelled[ss2, keys];
    Transpose[{rr1, rr2}]]

In[266]:= my4 = myMatch4[s1, s2]; // Timing
Out[266]= {5.748 Second, Null}

In[267]:= Map[Sort, ms7, {2}] === my4
Out[267]= True

One might obtain Brian's form just by Apply'ing Outer[List,##,1]& at level
{1} to get all matching pairs. 


>
>--
>Allan
>
>---------------------
>Allan Hayes
>Mathematica Training and Consulting
>Leicester UK
>www.haystack.demon.co.uk
>hay at haystack.demon.co.uk
>Voice: +44 (0)116 271 4198
>Fax: +44 (0)870 164 0565
>
>
>

Here now a better procedural solution, combining Daniel's ansatz with
observations made so far:

myMatch5[s1_, s2_] := 
  Module[{ss1 = SortSplit1[s1], ss2 = SortSplit1[s2], acc},
    Module[{u1 = ss1[[All, 1, 1]], u2 = ss2[[All, 1, 1]], j = 1, 
        jx = Length[ss1], k = 1, kx = Length[ss2], res = acc[]},
      While[j <= jx && k <= kx,
        Switch[Order[u1[[j]], u2[[k]]],
          1, ++j,
          -1, ++k,
          0, (res = acc[res, {ss1[[j]], ss2[[k]]}];
            ++j; ++k)] ];
      List @@ Flatten[res, Infinity, acc]]]

In[277]:= my5 = myMatch5[s1, s2]; // Timing
Out[277]= {2.313 Second, Null}

In[278]:= my5 === ms7
Out[278]= True

Effectively this is a prototype of a Sort-Merge procedure, pairing
subsequences of equal keys. Although Mathematica has a most performant Sort
procedure, there is nothing comparable for the merge step. Having this one
(or something similar) in the kernel, certainly would add value to the
product. SortSplit1 may be considered too.


Hartmut



  • Prev by Date: Re: Efficient Sorting Algorithm
  • Next by Date: Re: Question on factor group calculations
  • Previous by thread: Re: Pasting tables into Excel
  • Next by thread: Printing Graphics Problem