Re: Re: Union slowdown when SameTest is
- To: mathgroup at smc.vnet.net
- Subject: [mg101400] Re: [mg101389] Re: [mg101378] Union slowdown when SameTest is
- From: Leonid Shifrin <lshifr at gmail.com>
- Date: Sun, 5 Jul 2009 04:46:53 -0400 (EDT)
- References: <200907030940.FAA19348@smc.vnet.net>
Agreed. Nice to know, thanks. Mine will work on v.5 and even v.4.1 though (although this is probably irrelevant now), and also I think is a bit more pedagogical since it deals explicitly with the same problem that Tally handles internally and showcases the power of Dispatch. But for production, certainly yours is better. Regards, Leonid On Sat, Jul 4, 2009 at 1:24 PM, DrMajorBob <btreat1 at austin.rr.com> wrote: > Here's a faster and (arguably) simpler method, using unsortedUnion (Tally): > > unsortedUnion[x_List] := Tally[x][[All, 1]] > unsortedUnion[x_List, SameTest -> test : (_Symbol | _Function)] := > Tally[x, test][[All, 1]] > unsortedUnion[x_List, test : (_Symbol | _Function)] := > Tally[x, test][[All, 1]] > > test = RandomInteger[{1, 100}, {500, 2}]; > > Timing[one = > Union[test, SameTest -> (Subtract @@ #1 == Subtract @@ #2 &)];] > > {0.087628, Null} > > Timing[two = unionBy[test, Subtract @@ # &];] > > {0.002282, Null} > > Timing[three = > unsortedUnion[Sort@test, (Subtract @@ #1 == Subtract @@ #2 &)];] > > {0.00181, Null} > > one == two == three > > True > > largetest = RandomInteger[{1, 100}, {1000, 2, 12}]; > > Timing[one = Union[largetest, SameTest -> (First@#1 == First@#2 &)];] > > {0.990508, Null} > > Timing[two = unionBy[largetest, First];] > > {0.006182, Null} > > Timing[three = > unsortedUnion[Sort@largetest, (First@#1 == First@#2 &)];] > > {0.003405, Null} > > one == two == three > > True > > Bobby > > > On Sat, 04 Jul 2009 05:43:54 -0500, Leonid Shifrin <lshifr at gmail.com> > wrote: > > Hi, >> >> This problem has been noticed and discussed before. >> >> http://www.verbeia.com/mathematica/tips/HTMLLinks/Tricks_Misc_11.html, >> >> probably also on this group. I also discussed it in my book: >> >> http://www.mathprogramming-intro.org/book/node290.html >> >> >> This is how I understand this issue (I may be wrong): >> in the case of Union, the main reason for the huge performance difference >> is as follows: having a sameness function is a weaker requirement than >> having a comparison function. For native sameness function SameQ, there >> exists a corresponding comparison function OrderedQ (or its internal >> equivalent), therefore O(NlogN) algorithm can be used. For a generic >> comparison function, if one wants to stay completely general, all one can >> do >> is to compare elements pairwise, which leads to O(N^2) complexity. >> >> On top of this, there is another factor (perhaps, less important in the >> case of Union): when SameTest is not specified, the code of Union (or Sort >> etc) is entirely internal, written in C and very fast. When we specify >> SameTest, the Union code has to constantly interface with the high - level >> Mathematica code (to call SameTest comparison function), and that slows it >> down a lot. This is a general situation with Mathematica - there is on the >> average an order of magnitude or so gap between built - in and >> best-written >> user-defined functions in Mathematica, if the functionality is not "too >> close" to built-ins so that several of them can be "glued" together with a >> minimal performance hit. One reason for this is that built-ins are >> written >> in a much lower-level compiled language, another (related) is that they >> avoid the main evaluation loop and symbolic rewritings involved in general >> Mathematica computations.This shows up to a various degree in many >> problems. >> >> Returning to Union, I think it would be nice if Union had an option to >> provide a comparison function if it exists, and then use the O(NlogN) >> algorithm. The function below represents an attempt in this direction: >> >> Clear[unionBy]; >> unionBy[list_List, hashF : (_Symbol | _Function)] := >> unionBy[list, hashF /@ list]; >> >> unionBy[list_List, hashlist_List] /; >> Length[list] == Length[hashlist] := >> With[{ord = Ordering[list]}, >> With[{sorted = list[[ord]], sortedHashlist = hashlist[[ord]]}, >> Extract[sorted, >> Sort[Union[sortedHashlist] /. >> Dispatch[MapIndexed[Rule, sortedHashlist]]]]]]; >> >> As it follows from its name, this will work if there exists some kind of >> "hash" >> function that maps your data to another list on which we can use the >> native >> sort/union. The implementation hinges on Sort / Union algorithms being >> stable (that means, they don't displace elements which are already >> "equal"). >> Initial sorting of a list and list of hash values is needed so that the >> produced >> result is exactly the same as for standard Union - here we also use the >> fact >> >> that Sort, Ordering and Union used without SameTest use the same >> underlying >> sorting algorithm with the same native (canonical) sorting function. If, >> instead of the "hash-function", you already have a list of "hashcodes", >> then <unionBy> will be even faster. >> >> Examples: >> >> 1. Union of integer intervals, sametest - their length: >> >> In[1] = >> test = RandomInteger[{1, 100}, {500, 2}]; >> >> In[2] = Union[test, SameTest -> (Subtract @@ #1 == Subtract @@ #2 &)] // >> Short // Timing >> >> Out[2] = >> >> {0.551,{{1,1},{1,30},{1,57},{1,71},{1,72},<<156>>,{96,8},{98,21},{98,30},{98,36}}} >> >> In[3] = unionBy[test, Subtract @@ # &] // Short // Timing >> >> Out[3] = >> >> {0.01,{{1,1},{1,30},{1,57},{1,71},{1,72},<<156>>,{96,8},{98,21},{98,30},{98,36}}} >> >> In[4] = unionBy[test, Subtract @@ # &] === >> Union[test, SameTest -> (Subtract @@ #1 == Subtract @@ #2 &)] >> >> Out[4] = True >> >> 2. Your example (smaller sample): >> >> In[5] = >> largetest = RandomInteger[{1, 100}, {1000, 2, 12}]; >> >> In[6] = Union[largetest, SameTest -> (First@#1 == First@#2 &)] // >> Short // Timing >> >> Out[6] = >> >> {6.7,{{{1,2,30,69,19,67,70,65,56,79,77,72},{37,50,4,73,<<4>>,83,47,73,31}},<<999>>}} >> >> In[7] =unionBy[largetest, First] // Short // Timing >> >> Out[7] = >> >> {0.05,{{{1,2,30,69,19,67,70,65,56,79,77,72},{37,50,4,73,<<4>>,83,47,73,31}},<<999>>}} >> >> In[8] = unionBy[largetest, largetest[[All, 1]]] // Short // Timing >> >> Out[8] = >> >> {0.04,{{{1,2,30,69,19,67,70,65,56,79,77,72},{37,50,4,73,<<4>>,83,47,73,31}},<<999>>}} >> >> In[9] = Union[largetest, SameTest -> (First@#1 == First@#2 &)] === >> unionBy[largetest, First] >> >> Out[9] = True >> >> The main idea behind unionBy is described in my book at: >> >> http://www.mathprogramming-intro.org/book/node466.html >> >> where the code is analyzed line by line. >> >> Hope this helps. >> >> Regards, >> Leonid >> >> >> >> >> On Fri, Jul 3, 2009 at 2:40 AM, J Siehler <jsiehler at gmail.com> wrote: >> >> Hello group! I was a little surprised and puzzled by the following: >>> >>> In[1]:= $Version >>> Out[1]= "7.0 for Mac OS X PowerPC (32-bit) (November 10, 2008)" >>> >>> In[2]:= data = RandomInteger[{1, 10}, {5000, 2, 12}]; >>> Timing[Union[data];] >>> Out[3]= {0.008138, Null} >>> >>> So far so good; so far so speedy. But why does this happen: >>> >>> In[4]:= Timing[Union[data, SameTest -> Equal];] >>> Out[4]= {15.313, Null} >>> >>> Or more egregiously, >>> >>> In[5]:= Timing[Union[data, SameTest -> (First@#1 == First@#2 &)];] >>> Out[5]= {86.293, Null} >>> >>> There's nothing special about the form of the data here, just it >>> happens to be similar in form to the data I was working on when I >>> experienced terrible slowdown with the SameTest that I've shown in the >>> third example - which doesn't seem like a demanding comparison. So >>> can anyone explain the orders-of-magnitude change in times here? Much >>> appreciated! >>> >>> >>> >> > > > -- > DrMajorBob at bigfoot.com >
- References:
- Union slowdown when SameTest is specified
- From: J Siehler <jsiehler@gmail.com>
- Union slowdown when SameTest is specified