RE: Efficient Sorting Algorithm ++ SortSplit1 and Merge
- To: mathgroup at smc.vnet.net
- Subject: [mg38307] RE: Efficient Sorting Algorithm ++ SortSplit1 and Merge
- From: "Wolf, Hartmut" <Hartmut.Wolf at t-systems.com>
- Date: Thu, 12 Dec 2002 01:31:51 -0500 (EST)
- Sender: owner-wri-mathgroup at wolfram.com
Dear Allan, assuming you have posted this to MathGroup, I reply in public. >-----Original Message----- >From: Allan Hayes [mailto:hay at haystack.demon.co.uk] To: mathgroup at smc.vnet.net >Sent: Friday, December 06, 2002 9:49 PM >To: Wolf, Hartmut; Brian Higgins >Subject: [mg38307] Re: Efficient Sorting Algorithm > > > >I give below some further speed-ups. >The improvement, at least on the data I used, is due to using >Split[Sort[_]] right at the start to parcel the data - it probably depends >on there being a lot of repetition of the first term of the entries in the >data. Yes, the idea is, to exploit all work already done, and Composition[Split,Sort] appears to be a fundamental design pattern for the language! >The code for Matchings7 below is faster than Daniel Lichblau's recently >posted code - though he suggests that this might be speeded up by >substituting numbers for strings and compiling. > >Data > > s1 = Table[{FromCharacterCode[Table[Random[Integer, {65, 69}], {4}]], > Random[Integer, {1, 2000}]}, {6000}]; > s2 = Table[{FromCharacterCode[Table[Random[Integer, {65, 69}], {4}]], > Random[Integer, {1, 2000}]}, {12000}]; > >Get the members of s1 with the same first entry as some member of s2. > >Previous code (preserves order) > > Matched1[s1_,s2_]:= > Cases[ s1,{Alternatives@@Union[s2[[All,1]]],_}] > > Matched1[s1,s2];//Timing > > {3.4 Second,Null} > >New code (does not preserve order) - note the use of >s1[[Ordering[s1[[All, 1]]]]] to save on ordering with respect to the second entries > > Matched4[s1_,s2_]:= > Cases[Split[s1[[Ordering[s1[[All,1]]]]], #1[[1]]===#2[[1]]&], > {{Alternatives@@Union[s2[[All,1]]],_},___}] > > Matched4[s1,s2];//Timing > > {1.6 Second,Null} > This is very pleasing, and a general pattern if I understand Union[s2[[All,1]]] as the sorted list of keys searched. A slight improvement is still possible though. We suffer from a performance penalty imposed by the test function for Split. I had already published how to overcome this, here now recalled in a procedure: Attributes[SortSplit1] = {HoldFirst}; SortSplit1[s_] := Module[{ord = Ordering[s[[All, 1]]], secs}, secs = FoldList[Plus, 0, Length /@ Split[s[[ord, 1]] ]]; s[[Take[ord, #]]] & /@ Transpose[{Drop[secs + 1, -1], Drop[secs, 1]}]] ssb = SortSplit1[s1]; // Timing {0.271 Second, Null} whereas ssa = Split[s1[[Ordering[s1[[All, 1]]]]], #1[[1]] === #2[[1]] &]; // Timing {0.571 Second, Null} In[19]:= ssa == ssb Out[19]= True The Hold-Attribute for SortSplit1 is essential for its performance! If you don't want to give that, you alternatively may make a local copy of the argument as in SortSplit1x[s1_] := Module[{s = s1}, Module[{ord = Ordering[s[[All, 1]]]}, s[[Take[ord, #]]] & /@ With[{secs = FoldList[Plus, 0, Length /@ Split[s[[ord, 1]] ]]}, Transpose[{Drop[secs + 1, -1], Drop[secs, 1]}]]]] ssx = SortSplit1x[s1]; // Timing {0.301 Second, Null} In[56]:= ssx == ssa Out[56]= True I observe this, yet my understanding of Mathematica is not deep enough to explain it, let alone use this to deduce, so to speek, improved coding. Such I have to store it in my brain as an ad-hoc rule of surmise. I would be very pleased to have that clarified. With this we now may tweak Matched4 a bit further: Attributes[Matched4bis] = {HoldAll}; Matched4bis[s1_, s2_] := Cases[SortSplit1[s1], {{Alternatives @@ Union[s2[[All, 1]]], _}, ___}] (The need to Hold propagates! Not for SortSplit1x of course) Matched4x[s1_, s2_] := Cases[SortSplit1x[s1], {{Alternatives @@ Union[s2[[All, 1]]], _}, ___}] In[58]:= Matched4[s1, s2]; // Timing Out[58]= {1.061 Second, Null} In[59]:= Matched4bis[s1, s2]; // Timing Out[59]= {0.761 Second, Null} In[60]:= Matched4x[s1, s2]; // Timing Out[60]= {0.781 Second, Null} > >Get the full matchings (this code will work on {s1,s2,....,sn}, as well as >just {s1, s2}) > > Matchings7[s_]:= > Module[{st,pt,sp}, > st = #[[Ordering[#[[All,1]]]]]&/@s; > sp=Split[#,#1[[1]]===#2[[1]]&]&/@st; > pt= (Alternatives@@(Intersection@@ st[[All,All,1]])); > Transpose[Cases[#,{{pt,_},___}]&/@sp] > ]; > > (ms7=Matchings7[{s1,s2}]);//Timing > > {3.46 Second,Null} > I didn't manage to introduce SortSplit1 into Matchings7 (the Hold mechanisms seems to break down when Mapping SortSplit1 over the argument), however it works with SortSplit1x: Matchings7x[s_] := Module[{pt, sp}, sp = SortSplit1x /@ s; pt = (Alternatives @@ (Intersection @@ s[[All, All, 1]])); Transpose[Cases[#, {{pt, _}, ___}] & /@ sp]]; In[173]:= ms7x = Matchings7x[{s1, s2}]; // Timing Out[173]= {1.652 Second, Null} In[174]:= ms7 = Matchings7[{s1, s2}]; // Timing Out[174]= {2.364 Second, Null} In[175]:= ms7ax == ms7 Out[175]= True Well, I just made it (writing calls upon thinking): Attributes[Matchings7a] = {HoldAll}; Matchings7a[s_] := Module[{pt, sp}, sp = SortSplit1 /@ Unevaluated[s]; pt = (Alternatives @@ (Intersection @@ s[[All, All, 1]])); Transpose[Cases[#, {{pt, _}, ___}] & /@ sp]]; In[227]:= ms7a = Matchings7a[{s1, s2}]; // Timing Out[227]= {1.572 Second, Null} > >Daniel Lichtblau (gives essentially the same information as Matchings7) > > myTest[l1_, l2_] := > Module[{s1, s2, m = Length[l1], n = Length[l2], j, k, res = {}, ord }, > s1 = Sort[l1]; s2 = Sort[l2]; > For[j = 1; k = 1, j <= m && k <= n, Null, > ord = Order[s1[[j,1]], s2[[k,1]]]; > If[ord == 1, j++; Continue[]]; > If[ord == -1, k++; Continue[]]; > res = {res, {s1[[j]], s2[[k]]}}; > j++; k++; > ]; > Partition[Partition[Flatten[res], 2], 2] > ] > > myTest[s1,s2];//Timing > > {7.47 Second,Null} > I do not agree, as to this containing the same information, see: In[239]:= my = myTest[s1, s2]; // Timing Out[239]= {5.578 Second, Null} In[242]:= Length /@ {my, ms7} Out[242]= {5944, 625} In[243]:= Composition[Length, Union, Flatten] /@ {my, ms7} Out[243]= {2592, 2624} In[244]:= Length /@ {my, Flatten[ms7, 1]} Out[244]= {5944, 1250} Or compare In[248]:= ms7[[1]] Out[248]= {{{"AAAA", 747}, {"AAAA", 1580}, {"AAAA", 1929}, {"AAAA", 1277}}, {{"AAAA", 1241}, {"AAAA", 658}, {"AAAA", 141}, {"AAAA", 1567}, {"AAAA", 371}, {"AAAA", 861}, {"AAAA", 1963}, {"AAAA", 1607}, {"AAAA", 316}, {"AAAA", 1674}, {"AAAA", 1501}, {"AAAA", 1749}, {"AAAA", 1777}, {"AAAA", 172}, {"AAAA", 568}, {"AAAA", 1752}}} In[250]:= Take[my, 10] Out[250]= {{{"AAAA", 747}, {"AAAA", 141}}, {{"AAAA", 1277}, {"AAAA", 172}}, {{"AAAA", 1580}, {"AAAA", 316}}, {{"AAAA", 1929}, {"AAAA", 371}}, {{"AAAB", 417}, {"AAAB", 18}}, {{"AAAB", 1521}, {"AAAB", 332}}, {{"AAAB", 1526}, {"AAAB", 396}}, {{"AAAB", 1693}, {"AAAB", 525}}, {{"AAAB", 1961}, {"AAAB", 664}}, {{"AAAB", 1976}, {"AAAB", 827}}} Quite a lot is missing here! This is caused by the fact that at match, when Order[...] == 0, both pointers j, k are incremented, such elements from the longer sequence at this (matching) key will disappear. I had already communicated a procedural solution, here adapted to your form of output: myMatch4[s1_, s2_] := Module[{ss1 = Sort[s1], ss2 = Sort[s2], keys, rr1, rr2}, keys = Intersection[ss1[[All, 1]], ss2[[All, 1]]]; rr1 = getLabelled[ss1, keys]; rr2 = getLabelled[ss2, keys]; Transpose[{rr1, rr2}]] In[266]:= my4 = myMatch4[s1, s2]; // Timing Out[266]= {5.748 Second, Null} In[267]:= Map[Sort, ms7, {2}] === my4 Out[267]= True One might obtain Brian's form just by Apply'ing Outer[List,##,1]& at level {1} to get all matching pairs. > >-- >Allan > >--------------------- >Allan Hayes >Mathematica Training and Consulting >Leicester UK >www.haystack.demon.co.uk >hay at haystack.demon.co.uk >Voice: +44 (0)116 271 4198 >Fax: +44 (0)870 164 0565 > > > Here now a better procedural solution, combining Daniel's ansatz with observations made so far: myMatch5[s1_, s2_] := Module[{ss1 = SortSplit1[s1], ss2 = SortSplit1[s2], acc}, Module[{u1 = ss1[[All, 1, 1]], u2 = ss2[[All, 1, 1]], j = 1, jx = Length[ss1], k = 1, kx = Length[ss2], res = acc[]}, While[j <= jx && k <= kx, Switch[Order[u1[[j]], u2[[k]]], 1, ++j, -1, ++k, 0, (res = acc[res, {ss1[[j]], ss2[[k]]}]; ++j; ++k)] ]; List @@ Flatten[res, Infinity, acc]]] In[277]:= my5 = myMatch5[s1, s2]; // Timing Out[277]= {2.313 Second, Null} In[278]:= my5 === ms7 Out[278]= True Effectively this is a prototype of a Sort-Merge procedure, pairing subsequences of equal keys. Although Mathematica has a most performant Sort procedure, there is nothing comparable for the merge step. Having this one (or something similar) in the kernel, certainly would add value to the product. SortSplit1 may be considered too. Hartmut