Re: Union slowdown when SameTest is specified

• To: mathgroup at smc.vnet.net
• Subject: [mg101389] Re: [mg101378] Union slowdown when SameTest is specified
• From: Leonid Shifrin <lshifr at gmail.com>
• Date: Sat, 4 Jul 2009 06:43:54 -0400 (EDT)
• References: <200907030940.FAA19348@smc.vnet.net>

```Hi,

This problem has been noticed and discussed before.

probably also on this group. I also discussed it in my book:

http://www.mathprogramming-intro.org/book/node290.html

This is how I understand this issue (I may be wrong):
in the case of Union,  the main reason for the huge performance difference
is as follows: having a sameness function is a weaker requirement than
having a  comparison function. For native sameness function SameQ, there
exists  a corresponding comparison function OrderedQ (or its internal
equivalent),  therefore O(NlogN) algorithm can be used. For a generic
comparison function, if one wants to stay completely general, all one can do
is to compare elements pairwise, which leads to O(N^2) complexity.

On top of this, there is another factor (perhaps, less important in the
case of Union): when SameTest is not specified, the code of Union (or Sort
etc) is entirely internal, written in C and very fast. When we specify
SameTest, the Union code has to constantly interface with the high - level
Mathematica code (to call SameTest comparison function), and that slows it
down a lot. This is a general situation with Mathematica - there is on the
average an order of magnitude or so gap between built - in and best-written
user-defined functions in Mathematica, if the functionality is not "too
close" to built-ins so that several of them can be "glued" together with a
minimal performance hit.  One reason for this is that built-ins are written
in a much lower-level compiled language, another (related) is that they
avoid the main evaluation loop and symbolic rewritings involved in general
Mathematica computations.This shows up to a various degree in many problems.

Returning to  Union, I think it would be nice if Union had an option to
provide a comparison function if it exists, and then use the O(NlogN)
algorithm. The function below represents  an attempt in this direction:

Clear[unionBy];
unionBy[list_List, hashF : (_Symbol | _Function)] :=
unionBy[list, hashF /@ list];

unionBy[list_List, hashlist_List] /;
Length[list] == Length[hashlist] :=
With[{ord = Ordering[list]},
With[{sorted = list[[ord]], sortedHashlist = hashlist[[ord]]},
Extract[sorted,
Sort[Union[sortedHashlist] /.
Dispatch[MapIndexed[Rule, sortedHashlist]]]]]];

As it follows from its name, this will work if  there exists some kind of
"hash"
function that maps your data to another list on which we can use the native
sort/union. The implementation hinges on Sort / Union algorithms being
stable (that means, they don't displace elements which are already "equal").
Initial sorting of a list and list of hash values is needed so that the
produced
result is exactly the same as for standard Union - here we also use the fact

that Sort, Ordering and Union  used without SameTest use the same underlying
sorting algorithm with the same native (canonical) sorting function. If,
then <unionBy> will be even faster.

Examples:

1. Union of integer intervals, sametest - their length:

In[1] =
test = RandomInteger[{1, 100}, {500, 2}];

In[2] = Union[test, SameTest -> (Subtract @@ #1 == Subtract @@ #2 &)] //
Short // Timing

Out[2] =
{0.551,{{1,1},{1,30},{1,57},{1,71},{1,72},<<156>>,{96,8},{98,21},{98,30},{98,36}}}

In[3] = unionBy[test, Subtract @@ # &] // Short // Timing

Out[3] =
{0.01,{{1,1},{1,30},{1,57},{1,71},{1,72},<<156>>,{96,8},{98,21},{98,30},{98,36}}}

In[4] = unionBy[test, Subtract @@ # &] ===
Union[test, SameTest -> (Subtract @@ #1 == Subtract @@ #2 &)]

Out[4] = True

In[5] =
largetest = RandomInteger[{1, 100}, {1000, 2, 12}];

In[6] = Union[largetest, SameTest -> (First@#1 == First@#2 &)] //
Short // Timing

Out[6]  =
{6.7,{{{1,2,30,69,19,67,70,65,56,79,77,72},{37,50,4,73,<<4>>,83,47,73,31}},<<999>>}}

In[7] =unionBy[largetest, First] // Short // Timing

Out[7] =
{0.05,{{{1,2,30,69,19,67,70,65,56,79,77,72},{37,50,4,73,<<4>>,83,47,73,31}},<<999>>}}

In[8] = unionBy[largetest, largetest[[All, 1]]] // Short // Timing

Out[8] =
{0.04,{{{1,2,30,69,19,67,70,65,56,79,77,72},{37,50,4,73,<<4>>,83,47,73,31}},<<999>>}}

In[9] = Union[largetest, SameTest -> (First@#1 == First@#2 &)] ===
unionBy[largetest, First]

Out[9] = True

The main idea behind unionBy is described in my book at:

http://www.mathprogramming-intro.org/book/node466.html

where the code is analyzed line by line.

Hope this helps.

Regards,
Leonid

On Fri, Jul 3, 2009 at 2:40 AM, J Siehler <jsiehler at gmail.com> wrote:

> Hello group!  I was a little surprised and puzzled by the following:
>
> In[1]:= \$Version
> Out[1]= "7.0 for Mac OS X PowerPC (32-bit) (November 10, 2008)"
>
> In[2]:= data = RandomInteger[{1, 10}, {5000, 2, 12}];
> Timing[Union[data];]
> Out[3]= {0.008138, Null}
>
> So far so good; so far so speedy.  But why does this happen:
>
> In[4]:= Timing[Union[data, SameTest -> Equal];]
> Out[4]= {15.313, Null}
>
> Or more egregiously,
>
> In[5]:= Timing[Union[data, SameTest -> (First@#1 == First@#2 &)];]
> Out[5]= {86.293, Null}
>
> There's nothing special about the form of the data here, just it
> happens to be similar in form to the data I was working on when I
> experienced terrible slowdown with the SameTest that I've shown in the
> third example - which doesn't seem like a demanding comparison.    So
> can anyone explain the orders-of-magnitude change in times here?  Much
> appreciated!
>
>

```

• Prev by Date: Re: Union slowdown when SameTest is specified
• Next by Date: Re: Applying lists to FindRoot of a NIntegrate function
• Previous by thread: Re: Union slowdown when SameTest is specified
• Next by thread: Re: Re: Union slowdown when SameTest is