MathGroup Archive: July 2009 [00057]

[Date Index] [Thread Index] [Author Index]

Re: Union slowdown when SameTest is specified

To: mathgroup at smc.vnet.net
Subject: [mg101389] Re: [mg101378] Union slowdown when SameTest is specified
From: Leonid Shifrin <lshifr at gmail.com>
Date: Sat, 4 Jul 2009 06:43:54 -0400 (EDT)
References: <200907030940.FAA19348@smc.vnet.net>

Hi,

This problem has been noticed and discussed before.

http://www.verbeia.com/mathematica/tips/HTMLLinks/Tricks_Misc_11.html,

probably also on this group. I also discussed it in my book:

http://www.mathprogramming-intro.org/book/node290.html

This is how I understand this issue (I may be wrong):
in the case of Union,  the main reason for the huge performance difference
is as follows: having a sameness function is a weaker requirement than
having a  comparison function. For native sameness function SameQ, there
exists  a corresponding comparison function OrderedQ (or its internal
equivalent),  therefore O(NlogN) algorithm can be used. For a generic
comparison function, if one wants to stay completely general, all one can do
is to compare elements pairwise, which leads to O(N^2) complexity.

On top of this, there is another factor (perhaps, less important in the
case of Union): when SameTest is not specified, the code of Union (or Sort
etc) is entirely internal, written in C and very fast. When we specify
SameTest, the Union code has to constantly interface with the high - level
Mathematica code (to call SameTest comparison function), and that slows it
down a lot. This is a general situation with Mathematica - there is on the
average an order of magnitude or so gap between built - in and best-written
user-defined functions in Mathematica, if the functionality is not "too
close" to built-ins so that several of them can be "glued" together with a
minimal performance hit.  One reason for this is that built-ins are written
in a much lower-level compiled language, another (related) is that they
avoid the main evaluation loop and symbolic rewritings involved in general
Mathematica computations.This shows up to a various degree in many problems.

Returning to  Union, I think it would be nice if Union had an option to
provide a comparison function if it exists, and then use the O(NlogN)
algorithm. The function below represents  an attempt in this direction:

Clear[unionBy];
unionBy[list_List, hashF : (_Symbol | _Function)] :=
  unionBy[list, hashF /@ list];

unionBy[list_List, hashlist_List] /;
   Length[list] == Length[hashlist] :=
  With[{ord = Ordering[list]},
   With[{sorted = list[[ord]], sortedHashlist = hashlist[[ord]]},
    Extract[sorted,
     Sort[Union[sortedHashlist] /.
       Dispatch[MapIndexed[Rule, sortedHashlist]]]]]];

As it follows from its name, this will work if  there exists some kind of
"hash"
function that maps your data to another list on which we can use the native
sort/union. The implementation hinges on Sort / Union algorithms being
stable (that means, they don't displace elements which are already "equal").
Initial sorting of a list and list of hash values is needed so that the
produced
result is exactly the same as for standard Union - here we also use the fact

that Sort, Ordering and Union  used without SameTest use the same underlying
sorting algorithm with the same native (canonical) sorting function. If,
instead of the "hash-function", you already have a list of "hashcodes",
then <unionBy> will be even faster.

Examples:

1. Union of integer intervals, sametest - their length:

In[1] =
test = RandomInteger[{1, 100}, {500, 2}];

In[2] = Union[test, SameTest -> (Subtract @@ #1 == Subtract @@ #2 &)] //
  Short // Timing

Out[2] =
{0.551,{{1,1},{1,30},{1,57},{1,71},{1,72},<<156>>,{96,8},{98,21},{98,30},{98,36}}}

In[3] = unionBy[test, Subtract @@ # &] // Short // Timing

Out[3] =
{0.01,{{1,1},{1,30},{1,57},{1,71},{1,72},<<156>>,{96,8},{98,21},{98,30},{98,36}}}

In[4] = unionBy[test, Subtract @@ # &] ===
 Union[test, SameTest -> (Subtract @@ #1 == Subtract @@ #2 &)]

Out[4] = True

2. Your example (smaller sample):

In[5] =
largetest = RandomInteger[{1, 100}, {1000, 2, 12}];

In[6] = Union[largetest, SameTest -> (First@#1 == First@#2 &)] //
  Short // Timing

Out[6]  =
{6.7,{{{1,2,30,69,19,67,70,65,56,79,77,72},{37,50,4,73,<<4>>,83,47,73,31}},<<999>>}}

In[7] =unionBy[largetest, First] // Short // Timing

Out[7] =
{0.05,{{{1,2,30,69,19,67,70,65,56,79,77,72},{37,50,4,73,<<4>>,83,47,73,31}},<<999>>}}

In[8] = unionBy[largetest, largetest[[All, 1]]] // Short // Timing

Out[8] =
{0.04,{{{1,2,30,69,19,67,70,65,56,79,77,72},{37,50,4,73,<<4>>,83,47,73,31}},<<999>>}}

In[9] = Union[largetest, SameTest -> (First@#1 == First@#2 &)] ===
 unionBy[largetest, First]

Out[9] = True

The main idea behind unionBy is described in my book at:

http://www.mathprogramming-intro.org/book/node466.html

where the code is analyzed line by line.

Hope this helps.

Regards,
Leonid

On Fri, Jul 3, 2009 at 2:40 AM, J Siehler <jsiehler at gmail.com> wrote:

> Hello group!  I was a little surprised and puzzled by the following:
>
> In[1]:= $Version
> Out[1]= "7.0 for Mac OS X PowerPC (32-bit) (November 10, 2008)"
>
> In[2]:= data = RandomInteger[{1, 10}, {5000, 2, 12}];
> Timing[Union[data];]
> Out[3]= {0.008138, Null}
>
> So far so good; so far so speedy.  But why does this happen:
>
> In[4]:= Timing[Union[data, SameTest -> Equal];]
> Out[4]= {15.313, Null}
>
> Or more egregiously,
>
> In[5]:= Timing[Union[data, SameTest -> (First@#1 == First@#2 &)];]
> Out[5]= {86.293, Null}
>
> There's nothing special about the form of the data here, just it
> happens to be similar in form to the data I was working on when I
> experienced terrible slowdown with the SameTest that I've shown in the
> third example - which doesn't seem like a demanding comparison.    So
> can anyone explain the orders-of-magnitude change in times here?  Much
> appreciated!
>
>

References:
- Union slowdown when SameTest is specified
  - From: J Siehler <jsiehler@gmail.com>

Prev by Date: Re: Union slowdown when SameTest is specified

Next by Date: Re: Applying lists to FindRoot of a NIntegrate function

Previous by thread: Re: Union slowdown when SameTest is specified

Next by thread: Re: Re: Union slowdown when SameTest is