Re: Re: Union slowdown when SameTest is
- To: mathgroup at smc.vnet.net
- Subject: [mg101399] Re: [mg101389] Re: [mg101378] Union slowdown when SameTest is
- From: DrMajorBob <btreat1 at austin.rr.com>
- Date: Sun, 5 Jul 2009 04:46:42 -0400 (EDT)
- References: <200907030940.FAA19348@smc.vnet.net>
- Reply-to: drmajorbob at bigfoot.com
Here's a faster and (arguably) simpler method, using unsortedUnion (Tally): unsortedUnion[x_List] := Tally[x][[All, 1]] unsortedUnion[x_List, SameTest -> test : (_Symbol | _Function)] := Tally[x, test][[All, 1]] unsortedUnion[x_List, test : (_Symbol | _Function)] := Tally[x, test][[All, 1]] test = RandomInteger[{1, 100}, {500, 2}]; Timing[one = Union[test, SameTest -> (Subtract @@ #1 == Subtract @@ #2 &)];] {0.087628, Null} Timing[two = unionBy[test, Subtract @@ # &];] {0.002282, Null} Timing[three = unsortedUnion[Sort@test, (Subtract @@ #1 == Subtract @@ #2 &)];] {0.00181, Null} one == two == three True largetest = RandomInteger[{1, 100}, {1000, 2, 12}]; Timing[one = Union[largetest, SameTest -> (First@#1 == First@#2 &)];] {0.990508, Null} Timing[two = unionBy[largetest, First];] {0.006182, Null} Timing[three = unsortedUnion[Sort@largetest, (First@#1 == First@#2 &)];] {0.003405, Null} one == two == three True Bobby On Sat, 04 Jul 2009 05:43:54 -0500, Leonid Shifrin <lshifr at gmail.com> wrote: > Hi, > > This problem has been noticed and discussed before. > > http://www.verbeia.com/mathematica/tips/HTMLLinks/Tricks_Misc_11.html, > > probably also on this group. I also discussed it in my book: > > http://www.mathprogramming-intro.org/book/node290.html > > > This is how I understand this issue (I may be wrong): > in the case of Union, the main reason for the huge performance > difference > is as follows: having a sameness function is a weaker requirement than > having a comparison function. For native sameness function SameQ, there > exists a corresponding comparison function OrderedQ (or its internal > equivalent), therefore O(NlogN) algorithm can be used. For a generic > comparison function, if one wants to stay completely general, all one > can do > is to compare elements pairwise, which leads to O(N^2) complexity. > > On top of this, there is another factor (perhaps, less important in the > case of Union): when SameTest is not specified, the code of Union (or > Sort > etc) is entirely internal, written in C and very fast. When we specify > SameTest, the Union code has to constantly interface with the high - > level > Mathematica code (to call SameTest comparison function), and that slows > it > down a lot. This is a general situation with Mathematica - there is on > the > average an order of magnitude or so gap between built - in and > best-written > user-defined functions in Mathematica, if the functionality is not "too > close" to built-ins so that several of them can be "glued" together with > a > minimal performance hit. One reason for this is that built-ins are > written > in a much lower-level compiled language, another (related) is that they > avoid the main evaluation loop and symbolic rewritings involved in > general > Mathematica computations.This shows up to a various degree in many > problems. > > Returning to Union, I think it would be nice if Union had an option to > provide a comparison function if it exists, and then use the O(NlogN) > algorithm. The function below represents an attempt in this direction: > > Clear[unionBy]; > unionBy[list_List, hashF : (_Symbol | _Function)] := > unionBy[list, hashF /@ list]; > > unionBy[list_List, hashlist_List] /; > Length[list] == Length[hashlist] := > With[{ord = Ordering[list]}, > With[{sorted = list[[ord]], sortedHashlist = hashlist[[ord]]}, > Extract[sorted, > Sort[Union[sortedHashlist] /. > Dispatch[MapIndexed[Rule, sortedHashlist]]]]]]; > > As it follows from its name, this will work if there exists some kind of > "hash" > function that maps your data to another list on which we can use the > native > sort/union. The implementation hinges on Sort / Union algorithms being > stable (that means, they don't displace elements which are already > "equal"). > Initial sorting of a list and list of hash values is needed so that the > produced > result is exactly the same as for standard Union - here we also use the > fact > > that Sort, Ordering and Union used without SameTest use the same > underlying > sorting algorithm with the same native (canonical) sorting function. If, > instead of the "hash-function", you already have a list of "hashcodes", > then <unionBy> will be even faster. > > Examples: > > 1. Union of integer intervals, sametest - their length: > > In[1] = > test = RandomInteger[{1, 100}, {500, 2}]; > > In[2] = Union[test, SameTest -> (Subtract @@ #1 == Subtract @@ #2 &)] // > Short // Timing > > Out[2] = > {0.551,{{1,1},{1,30},{1,57},{1,71},{1,72},<<156>>,{96,8},{98,21},{98,30},{98,36}}} > > In[3] = unionBy[test, Subtract @@ # &] // Short // Timing > > Out[3] = > {0.01,{{1,1},{1,30},{1,57},{1,71},{1,72},<<156>>,{96,8},{98,21},{98,30},{98,36}}} > > In[4] = unionBy[test, Subtract @@ # &] === > Union[test, SameTest -> (Subtract @@ #1 == Subtract @@ #2 &)] > > Out[4] = True > > 2. Your example (smaller sample): > > In[5] = > largetest = RandomInteger[{1, 100}, {1000, 2, 12}]; > > In[6] = Union[largetest, SameTest -> (First@#1 == First@#2 &)] // > Short // Timing > > Out[6] = > {6.7,{{{1,2,30,69,19,67,70,65,56,79,77,72},{37,50,4,73,<<4>>,83,47,73,31}},<<999>>}} > > In[7] =unionBy[largetest, First] // Short // Timing > > Out[7] = > {0.05,{{{1,2,30,69,19,67,70,65,56,79,77,72},{37,50,4,73,<<4>>,83,47,73,31}},<<999>>}} > > In[8] = unionBy[largetest, largetest[[All, 1]]] // Short // Timing > > Out[8] = > {0.04,{{{1,2,30,69,19,67,70,65,56,79,77,72},{37,50,4,73,<<4>>,83,47,73,31}},<<999>>}} > > In[9] = Union[largetest, SameTest -> (First@#1 == First@#2 &)] === > unionBy[largetest, First] > > Out[9] = True > > The main idea behind unionBy is described in my book at: > > http://www.mathprogramming-intro.org/book/node466.html > > where the code is analyzed line by line. > > Hope this helps. > > Regards, > Leonid > > > > > On Fri, Jul 3, 2009 at 2:40 AM, J Siehler <jsiehler at gmail.com> wrote: > >> Hello group! I was a little surprised and puzzled by the following: >> >> In[1]:= $Version >> Out[1]= "7.0 for Mac OS X PowerPC (32-bit) (November 10, 2008)" >> >> In[2]:= data = RandomInteger[{1, 10}, {5000, 2, 12}]; >> Timing[Union[data];] >> Out[3]= {0.008138, Null} >> >> So far so good; so far so speedy. But why does this happen: >> >> In[4]:= Timing[Union[data, SameTest -> Equal];] >> Out[4]= {15.313, Null} >> >> Or more egregiously, >> >> In[5]:= Timing[Union[data, SameTest -> (First@#1 == First@#2 &)];] >> Out[5]= {86.293, Null} >> >> There's nothing special about the form of the data here, just it >> happens to be similar in form to the data I was working on when I >> experienced terrible slowdown with the SameTest that I've shown in the >> third example - which doesn't seem like a demanding comparison. So >> can anyone explain the orders-of-magnitude change in times here? Much >> appreciated! >> >> > -- DrMajorBob at bigfoot.com
- References:
- Union slowdown when SameTest is specified
- From: J Siehler <jsiehler@gmail.com>
- Union slowdown when SameTest is specified