Re: Union slowdown when SameTest is specified

*To*: mathgroup at smc.vnet.net*Subject*: [mg101389] Re: [mg101378] Union slowdown when SameTest is specified*From*: Leonid Shifrin <lshifr at gmail.com>*Date*: Sat, 4 Jul 2009 06:43:54 -0400 (EDT)*References*: <200907030940.FAA19348@smc.vnet.net>

Hi, This problem has been noticed and discussed before. http://www.verbeia.com/mathematica/tips/HTMLLinks/Tricks_Misc_11.html, probably also on this group. I also discussed it in my book: http://www.mathprogramming-intro.org/book/node290.html This is how I understand this issue (I may be wrong): in the case of Union, the main reason for the huge performance difference is as follows: having a sameness function is a weaker requirement than having a comparison function. For native sameness function SameQ, there exists a corresponding comparison function OrderedQ (or its internal equivalent), therefore O(NlogN) algorithm can be used. For a generic comparison function, if one wants to stay completely general, all one can do is to compare elements pairwise, which leads to O(N^2) complexity. On top of this, there is another factor (perhaps, less important in the case of Union): when SameTest is not specified, the code of Union (or Sort etc) is entirely internal, written in C and very fast. When we specify SameTest, the Union code has to constantly interface with the high - level Mathematica code (to call SameTest comparison function), and that slows it down a lot. This is a general situation with Mathematica - there is on the average an order of magnitude or so gap between built - in and best-written user-defined functions in Mathematica, if the functionality is not "too close" to built-ins so that several of them can be "glued" together with a minimal performance hit. One reason for this is that built-ins are written in a much lower-level compiled language, another (related) is that they avoid the main evaluation loop and symbolic rewritings involved in general Mathematica computations.This shows up to a various degree in many problems. Returning to Union, I think it would be nice if Union had an option to provide a comparison function if it exists, and then use the O(NlogN) algorithm. The function below represents an attempt in this direction: Clear[unionBy]; unionBy[list_List, hashF : (_Symbol | _Function)] := unionBy[list, hashF /@ list]; unionBy[list_List, hashlist_List] /; Length[list] == Length[hashlist] := With[{ord = Ordering[list]}, With[{sorted = list[[ord]], sortedHashlist = hashlist[[ord]]}, Extract[sorted, Sort[Union[sortedHashlist] /. Dispatch[MapIndexed[Rule, sortedHashlist]]]]]]; As it follows from its name, this will work if there exists some kind of "hash" function that maps your data to another list on which we can use the native sort/union. The implementation hinges on Sort / Union algorithms being stable (that means, they don't displace elements which are already "equal"). Initial sorting of a list and list of hash values is needed so that the produced result is exactly the same as for standard Union - here we also use the fact that Sort, Ordering and Union used without SameTest use the same underlying sorting algorithm with the same native (canonical) sorting function. If, instead of the "hash-function", you already have a list of "hashcodes", then <unionBy> will be even faster. Examples: 1. Union of integer intervals, sametest - their length: In[1] = test = RandomInteger[{1, 100}, {500, 2}]; In[2] = Union[test, SameTest -> (Subtract @@ #1 == Subtract @@ #2 &)] // Short // Timing Out[2] = {0.551,{{1,1},{1,30},{1,57},{1,71},{1,72},<<156>>,{96,8},{98,21},{98,30},{98,36}}} In[3] = unionBy[test, Subtract @@ # &] // Short // Timing Out[3] = {0.01,{{1,1},{1,30},{1,57},{1,71},{1,72},<<156>>,{96,8},{98,21},{98,30},{98,36}}} In[4] = unionBy[test, Subtract @@ # &] === Union[test, SameTest -> (Subtract @@ #1 == Subtract @@ #2 &)] Out[4] = True 2. Your example (smaller sample): In[5] = largetest = RandomInteger[{1, 100}, {1000, 2, 12}]; In[6] = Union[largetest, SameTest -> (First@#1 == First@#2 &)] // Short // Timing Out[6] = {6.7,{{{1,2,30,69,19,67,70,65,56,79,77,72},{37,50,4,73,<<4>>,83,47,73,31}},<<999>>}} In[7] =unionBy[largetest, First] // Short // Timing Out[7] = {0.05,{{{1,2,30,69,19,67,70,65,56,79,77,72},{37,50,4,73,<<4>>,83,47,73,31}},<<999>>}} In[8] = unionBy[largetest, largetest[[All, 1]]] // Short // Timing Out[8] = {0.04,{{{1,2,30,69,19,67,70,65,56,79,77,72},{37,50,4,73,<<4>>,83,47,73,31}},<<999>>}} In[9] = Union[largetest, SameTest -> (First@#1 == First@#2 &)] === unionBy[largetest, First] Out[9] = True The main idea behind unionBy is described in my book at: http://www.mathprogramming-intro.org/book/node466.html where the code is analyzed line by line. Hope this helps. Regards, Leonid On Fri, Jul 3, 2009 at 2:40 AM, J Siehler <jsiehler at gmail.com> wrote: > Hello group! I was a little surprised and puzzled by the following: > > In[1]:= $Version > Out[1]= "7.0 for Mac OS X PowerPC (32-bit) (November 10, 2008)" > > In[2]:= data = RandomInteger[{1, 10}, {5000, 2, 12}]; > Timing[Union[data];] > Out[3]= {0.008138, Null} > > So far so good; so far so speedy. But why does this happen: > > In[4]:= Timing[Union[data, SameTest -> Equal];] > Out[4]= {15.313, Null} > > Or more egregiously, > > In[5]:= Timing[Union[data, SameTest -> (First@#1 == First@#2 &)];] > Out[5]= {86.293, Null} > > There's nothing special about the form of the data here, just it > happens to be similar in form to the data I was working on when I > experienced terrible slowdown with the SameTest that I've shown in the > third example - which doesn't seem like a demanding comparison. So > can anyone explain the orders-of-magnitude change in times here? Much > appreciated! > >

**References**:**Union slowdown when SameTest is specified***From:*J Siehler <jsiehler@gmail.com>