Re: Re: Re: Union slowdown when
- To: mathgroup at smc.vnet.net
- Subject: [mg101413] Re: [mg101400] Re: [mg101389] Re: [mg101378] Union slowdown when
- From: DrMajorBob <btreat1 at austin.rr.com>
- Date: Sun, 5 Jul 2009 23:16:46 -0400 (EDT)
- References: <200907030940.FAA19348@smc.vnet.net>
- Reply-to: drmajorbob at bigfoot.com
Apparently, Tally doesn't slow down for custom SameTest functions, in the way that Union does. I suppose that's because Union wants an ordering, while Tally needs only a test for equivalence. Bobby On Sun, 05 Jul 2009 03:46:53 -0500, Leonid Shifrin <lshifr at gmail.com> wrote: > Agreed. Nice to know, thanks. Mine will work on v.5 and even v.4.1 though > (although this is probably irrelevant now), and also I think is a bit > more > pedagogical since it deals explicitly with the same problem that Tally > handles > internally and showcases the power of Dispatch. But for production, > certainly yours is better. > > Regards, > Leonid > > > On Sat, Jul 4, 2009 at 1:24 PM, DrMajorBob <btreat1 at austin.rr.com> wrote: > >> Here's a faster and (arguably) simpler method, using unsortedUnion >> (Tally): >> >> unsortedUnion[x_List] := Tally[x][[All, 1]] >> unsortedUnion[x_List, SameTest -> test : (_Symbol | _Function)] := >> Tally[x, test][[All, 1]] >> unsortedUnion[x_List, test : (_Symbol | _Function)] := >> Tally[x, test][[All, 1]] >> >> test = RandomInteger[{1, 100}, {500, 2}]; >> >> Timing[one = >> Union[test, SameTest -> (Subtract @@ #1 == Subtract @@ #2 &)];] >> >> {0.087628, Null} >> >> Timing[two = unionBy[test, Subtract @@ # &];] >> >> {0.002282, Null} >> >> Timing[three = >> unsortedUnion[Sort@test, (Subtract @@ #1 == Subtract @@ #2 &)];] >> >> {0.00181, Null} >> >> one == two == three >> >> True >> >> largetest = RandomInteger[{1, 100}, {1000, 2, 12}]; >> >> Timing[one = Union[largetest, SameTest -> (First@#1 == First@#2 &)];] >> >> {0.990508, Null} >> >> Timing[two = unionBy[largetest, First];] >> >> {0.006182, Null} >> >> Timing[three = >> unsortedUnion[Sort@largetest, (First@#1 == First@#2 &)];] >> >> {0.003405, Null} >> >> one == two == three >> >> True >> >> Bobby >> >> >> On Sat, 04 Jul 2009 05:43:54 -0500, Leonid Shifrin <lshifr at gmail.com> >> wrote: >> >> Hi, >>> >>> This problem has been noticed and discussed before. >>> >>> http://www.verbeia.com/mathematica/tips/HTMLLinks/Tricks_Misc_11.html, >>> >>> probably also on this group. I also discussed it in my book: >>> >>> http://www.mathprogramming-intro.org/book/node290.html >>> >>> >>> This is how I understand this issue (I may be wrong): >>> in the case of Union, the main reason for the huge performance >>> difference >>> is as follows: having a sameness function is a weaker requirement than >>> having a comparison function. For native sameness function SameQ, >>> there >>> exists a corresponding comparison function OrderedQ (or its internal >>> equivalent), therefore O(NlogN) algorithm can be used. For a generic >>> comparison function, if one wants to stay completely general, all one >>> can >>> do >>> is to compare elements pairwise, which leads to O(N^2) complexity. >>> >>> On top of this, there is another factor (perhaps, less important in the >>> case of Union): when SameTest is not specified, the code of Union (or >>> Sort >>> etc) is entirely internal, written in C and very fast. When we specify >>> SameTest, the Union code has to constantly interface with the high - >>> level >>> Mathematica code (to call SameTest comparison function), and that >>> slows it >>> down a lot. This is a general situation with Mathematica - there is on >>> the >>> average an order of magnitude or so gap between built - in and >>> best-written >>> user-defined functions in Mathematica, if the functionality is not "too >>> close" to built-ins so that several of them can be "glued" together >>> with a >>> minimal performance hit. One reason for this is that built-ins are >>> written >>> in a much lower-level compiled language, another (related) is that they >>> avoid the main evaluation loop and symbolic rewritings involved in >>> general >>> Mathematica computations.This shows up to a various degree in many >>> problems. >>> >>> Returning to Union, I think it would be nice if Union had an option to >>> provide a comparison function if it exists, and then use the O(NlogN) >>> algorithm. The function below represents an attempt in this direction: >>> >>> Clear[unionBy]; >>> unionBy[list_List, hashF : (_Symbol | _Function)] := >>> unionBy[list, hashF /@ list]; >>> >>> unionBy[list_List, hashlist_List] /; >>> Length[list] == Length[hashlist] := >>> With[{ord = Ordering[list]}, >>> With[{sorted = list[[ord]], sortedHashlist = hashlist[[ord]]}, >>> Extract[sorted, >>> Sort[Union[sortedHashlist] /. >>> Dispatch[MapIndexed[Rule, sortedHashlist]]]]]]; >>> >>> As it follows from its name, this will work if there exists some kind >>> of >>> "hash" >>> function that maps your data to another list on which we can use the >>> native >>> sort/union. The implementation hinges on Sort / Union algorithms being >>> stable (that means, they don't displace elements which are already >>> "equal"). >>> Initial sorting of a list and list of hash values is needed so that the >>> produced >>> result is exactly the same as for standard Union - here we also use the >>> fact >>> >>> that Sort, Ordering and Union used without SameTest use the same >>> underlying >>> sorting algorithm with the same native (canonical) sorting function. >>> If, >>> instead of the "hash-function", you already have a list of "hashcodes", >>> then <unionBy> will be even faster. >>> >>> Examples: >>> >>> 1. Union of integer intervals, sametest - their length: >>> >>> In[1] = >>> test = RandomInteger[{1, 100}, {500, 2}]; >>> >>> In[2] = Union[test, SameTest -> (Subtract @@ #1 == Subtract @@ #2 &)] >>> // >>> Short // Timing >>> >>> Out[2] = >>> >>> {0.551,{{1,1},{1,30},{1,57},{1,71},{1,72},<<156>>,{96,8},{98,21},{98,30},{98,36}}} >>> >>> In[3] = unionBy[test, Subtract @@ # &] // Short // Timing >>> >>> Out[3] = >>> >>> {0.01,{{1,1},{1,30},{1,57},{1,71},{1,72},<<156>>,{96,8},{98,21},{98,30},{98,36}}} >>> >>> In[4] = unionBy[test, Subtract @@ # &] === >>> Union[test, SameTest -> (Subtract @@ #1 == Subtract @@ #2 &)] >>> >>> Out[4] = True >>> >>> 2. Your example (smaller sample): >>> >>> In[5] = >>> largetest = RandomInteger[{1, 100}, {1000, 2, 12}]; >>> >>> In[6] = Union[largetest, SameTest -> (First@#1 == First@#2 &)] // >>> Short // Timing >>> >>> Out[6] = >>> >>> {6.7,{{{1,2,30,69,19,67,70,65,56,79,77,72},{37,50,4,73,<<4>>,83,47,73,31}},<<999>>}} >>> >>> In[7] =unionBy[largetest, First] // Short // Timing >>> >>> Out[7] = >>> >>> {0.05,{{{1,2,30,69,19,67,70,65,56,79,77,72},{37,50,4,73,<<4>>,83,47,73,31}},<<999>>}} >>> >>> In[8] = unionBy[largetest, largetest[[All, 1]]] // Short // Timing >>> >>> Out[8] = >>> >>> {0.04,{{{1,2,30,69,19,67,70,65,56,79,77,72},{37,50,4,73,<<4>>,83,47,73,31}},<<999>>}} >>> >>> In[9] = Union[largetest, SameTest -> (First@#1 == First@#2 &)] === >>> unionBy[largetest, First] >>> >>> Out[9] = True >>> >>> The main idea behind unionBy is described in my book at: >>> >>> http://www.mathprogramming-intro.org/book/node466.html >>> >>> where the code is analyzed line by line. >>> >>> Hope this helps. >>> >>> Regards, >>> Leonid >>> >>> >>> >>> >>> On Fri, Jul 3, 2009 at 2:40 AM, J Siehler <jsiehler at gmail.com> wrote: >>> >>> Hello group! I was a little surprised and puzzled by the following: >>>> >>>> In[1]:= $Version >>>> Out[1]= "7.0 for Mac OS X PowerPC (32-bit) (November 10, 2008)" >>>> >>>> In[2]:= data = RandomInteger[{1, 10}, {5000, 2, 12}]; >>>> Timing[Union[data];] >>>> Out[3]= {0.008138, Null} >>>> >>>> So far so good; so far so speedy. But why does this happen: >>>> >>>> In[4]:= Timing[Union[data, SameTest -> Equal];] >>>> Out[4]= {15.313, Null} >>>> >>>> Or more egregiously, >>>> >>>> In[5]:= Timing[Union[data, SameTest -> (First@#1 == First@#2 &)];] >>>> Out[5]= {86.293, Null} >>>> >>>> There's nothing special about the form of the data here, just it >>>> happens to be similar in form to the data I was working on when I >>>> experienced terrible slowdown with the SameTest that I've shown in the >>>> third example - which doesn't seem like a demanding comparison. So >>>> can anyone explain the orders-of-magnitude change in times here? Much >>>> appreciated! >>>> >>>> >>>> >>> >> >> >> -- >> DrMajorBob at bigfoot.com >> > > -- DrMajorBob at bigfoot.com
- References:
- Union slowdown when SameTest is specified
- From: J Siehler <jsiehler@gmail.com>
- Union slowdown when SameTest is specified