MathGroup Archive 2009

[Date Index] [Thread Index] [Author Index]

Search the Archive

Re: Re: Union slowdown when SameTest is

  • To: mathgroup at smc.vnet.net
  • Subject: [mg101399] Re: [mg101389] Re: [mg101378] Union slowdown when SameTest is
  • From: DrMajorBob <btreat1 at austin.rr.com>
  • Date: Sun, 5 Jul 2009 04:46:42 -0400 (EDT)
  • References: <200907030940.FAA19348@smc.vnet.net>
  • Reply-to: drmajorbob at bigfoot.com

Here's a faster and (arguably) simpler method, using unsortedUnion (Tally):

unsortedUnion[x_List] := Tally[x][[All, 1]]
unsortedUnion[x_List, SameTest -> test : (_Symbol | _Function)] :=
  Tally[x, test][[All, 1]]
unsortedUnion[x_List, test : (_Symbol | _Function)] :=
  Tally[x, test][[All, 1]]

test = RandomInteger[{1, 100}, {500, 2}];

Timing[one =
    Union[test, SameTest -> (Subtract @@ #1 == Subtract @@ #2 &)];]

{0.087628, Null}

Timing[two = unionBy[test, Subtract @@ # &];]

{0.002282, Null}

Timing[three =
    unsortedUnion[Sort@test, (Subtract @@ #1 == Subtract @@ #2 &)];]

{0.00181, Null}

one == two == three

True

largetest = RandomInteger[{1, 100}, {1000, 2, 12}];

Timing[one = Union[largetest, SameTest -> (First@#1 == First@#2 &)];]

{0.990508, Null}

Timing[two = unionBy[largetest, First];]

{0.006182, Null}

Timing[three =
    unsortedUnion[Sort@largetest, (First@#1 == First@#2 &)];]

{0.003405, Null}

one == two == three

True

Bobby

On Sat, 04 Jul 2009 05:43:54 -0500, Leonid Shifrin <lshifr at gmail.com>  
wrote:

> Hi,
>
> This problem has been noticed and discussed before.
>
> http://www.verbeia.com/mathematica/tips/HTMLLinks/Tricks_Misc_11.html,
>
> probably also on this group. I also discussed it in my book:
>
> http://www.mathprogramming-intro.org/book/node290.html
>
>
> This is how I understand this issue (I may be wrong):
> in the case of Union,  the main reason for the huge performance  
> difference
> is as follows: having a sameness function is a weaker requirement than
> having a  comparison function. For native sameness function SameQ, there
> exists  a corresponding comparison function OrderedQ (or its internal
> equivalent),  therefore O(NlogN) algorithm can be used. For a generic
> comparison function, if one wants to stay completely general, all one  
> can do
> is to compare elements pairwise, which leads to O(N^2) complexity.
>
> On top of this, there is another factor (perhaps, less important in the
> case of Union): when SameTest is not specified, the code of Union (or  
> Sort
> etc) is entirely internal, written in C and very fast. When we specify
> SameTest, the Union code has to constantly interface with the high -  
> level
> Mathematica code (to call SameTest comparison function), and that slows  
> it
> down a lot. This is a general situation with Mathematica - there is on  
> the
> average an order of magnitude or so gap between built - in and  
> best-written
> user-defined functions in Mathematica, if the functionality is not "too
> close" to built-ins so that several of them can be "glued" together with  
> a
> minimal performance hit.  One reason for this is that built-ins are  
> written
> in a much lower-level compiled language, another (related) is that they
> avoid the main evaluation loop and symbolic rewritings involved in  
> general
> Mathematica computations.This shows up to a various degree in many  
> problems.
>
> Returning to  Union, I think it would be nice if Union had an option to
> provide a comparison function if it exists, and then use the O(NlogN)
> algorithm. The function below represents  an attempt in this direction:
>
> Clear[unionBy];
> unionBy[list_List, hashF : (_Symbol | _Function)] :=
>   unionBy[list, hashF /@ list];
>
> unionBy[list_List, hashlist_List] /;
>    Length[list] == Length[hashlist] :=
>   With[{ord = Ordering[list]},
>    With[{sorted = list[[ord]], sortedHashlist = hashlist[[ord]]},
>     Extract[sorted,
>      Sort[Union[sortedHashlist] /.
>        Dispatch[MapIndexed[Rule, sortedHashlist]]]]]];
>
> As it follows from its name, this will work if  there exists some kind of
> "hash"
> function that maps your data to another list on which we can use the  
> native
> sort/union. The implementation hinges on Sort / Union algorithms being
> stable (that means, they don't displace elements which are already  
> "equal").
> Initial sorting of a list and list of hash values is needed so that the
> produced
> result is exactly the same as for standard Union - here we also use the  
> fact
>
> that Sort, Ordering and Union  used without SameTest use the same  
> underlying
> sorting algorithm with the same native (canonical) sorting function. If,
> instead of the "hash-function", you already have a list of "hashcodes",
> then <unionBy> will be even faster.
>
> Examples:
>
> 1. Union of integer intervals, sametest - their length:
>
> In[1] =
> test = RandomInteger[{1, 100}, {500, 2}];
>
> In[2] = Union[test, SameTest -> (Subtract @@ #1 == Subtract @@ #2 &)] //
>   Short // Timing
>
> Out[2] =
> {0.551,{{1,1},{1,30},{1,57},{1,71},{1,72},<<156>>,{96,8},{98,21},{98,30},{98,36}}}
>
> In[3] = unionBy[test, Subtract @@ # &] // Short // Timing
>
> Out[3] =
> {0.01,{{1,1},{1,30},{1,57},{1,71},{1,72},<<156>>,{96,8},{98,21},{98,30},{98,36}}}
>
> In[4] = unionBy[test, Subtract @@ # &] ===
>  Union[test, SameTest -> (Subtract @@ #1 == Subtract @@ #2 &)]
>
> Out[4] = True
>
> 2. Your example (smaller sample):
>
> In[5] =
> largetest = RandomInteger[{1, 100}, {1000, 2, 12}];
>
> In[6] = Union[largetest, SameTest -> (First@#1 == First@#2 &)] //
>   Short // Timing
>
> Out[6]  =
> {6.7,{{{1,2,30,69,19,67,70,65,56,79,77,72},{37,50,4,73,<<4>>,83,47,73,31}},<<999>>}}
>
> In[7] =unionBy[largetest, First] // Short // Timing
>
> Out[7] =
> {0.05,{{{1,2,30,69,19,67,70,65,56,79,77,72},{37,50,4,73,<<4>>,83,47,73,31}},<<999>>}}
>
> In[8] = unionBy[largetest, largetest[[All, 1]]] // Short // Timing
>
> Out[8] =
> {0.04,{{{1,2,30,69,19,67,70,65,56,79,77,72},{37,50,4,73,<<4>>,83,47,73,31}},<<999>>}}
>
> In[9] = Union[largetest, SameTest -> (First@#1 == First@#2 &)] ===
>  unionBy[largetest, First]
>
> Out[9] = True
>
> The main idea behind unionBy is described in my book at:
>
> http://www.mathprogramming-intro.org/book/node466.html
>
> where the code is analyzed line by line.
>
> Hope this helps.
>
> Regards,
> Leonid
>
>
>
>
> On Fri, Jul 3, 2009 at 2:40 AM, J Siehler <jsiehler at gmail.com> wrote:
>
>> Hello group!  I was a little surprised and puzzled by the following:
>>
>> In[1]:= $Version
>> Out[1]= "7.0 for Mac OS X PowerPC (32-bit) (November 10, 2008)"
>>
>> In[2]:= data = RandomInteger[{1, 10}, {5000, 2, 12}];
>> Timing[Union[data];]
>> Out[3]= {0.008138, Null}
>>
>> So far so good; so far so speedy.  But why does this happen:
>>
>> In[4]:= Timing[Union[data, SameTest -> Equal];]
>> Out[4]= {15.313, Null}
>>
>> Or more egregiously,
>>
>> In[5]:= Timing[Union[data, SameTest -> (First@#1 == First@#2 &)];]
>> Out[5]= {86.293, Null}
>>
>> There's nothing special about the form of the data here, just it
>> happens to be similar in form to the data I was working on when I
>> experienced terrible slowdown with the SameTest that I've shown in the
>> third example - which doesn't seem like a demanding comparison.    So
>> can anyone explain the orders-of-magnitude change in times here?  Much
>> appreciated!
>>
>>
>



-- 
DrMajorBob at bigfoot.com


  • Prev by Date: Re: Manipulate subtleties?
  • Next by Date: Re: Parametric cylinder plots round in one instance and elliptical in
  • Previous by thread: Re: Union slowdown when SameTest is specified
  • Next by thread: Re: Re: Union slowdown when SameTest is