       Re: Re: Union slowdown when SameTest is

• To: mathgroup at smc.vnet.net
• Subject: [mg101399] Re: [mg101389] Re: [mg101378] Union slowdown when SameTest is
• From: DrMajorBob <btreat1 at austin.rr.com>
• Date: Sun, 5 Jul 2009 04:46:42 -0400 (EDT)
• References: <200907030940.FAA19348@smc.vnet.net>

```Here's a faster and (arguably) simpler method, using unsortedUnion (Tally):

unsortedUnion[x_List] := Tally[x][[All, 1]]
unsortedUnion[x_List, SameTest -> test : (_Symbol | _Function)] :=
Tally[x, test][[All, 1]]
unsortedUnion[x_List, test : (_Symbol | _Function)] :=
Tally[x, test][[All, 1]]

test = RandomInteger[{1, 100}, {500, 2}];

Timing[one =
Union[test, SameTest -> (Subtract @@ #1 == Subtract @@ #2 &)];]

{0.087628, Null}

Timing[two = unionBy[test, Subtract @@ # &];]

{0.002282, Null}

Timing[three =
unsortedUnion[Sort@test, (Subtract @@ #1 == Subtract @@ #2 &)];]

{0.00181, Null}

one == two == three

True

largetest = RandomInteger[{1, 100}, {1000, 2, 12}];

Timing[one = Union[largetest, SameTest -> (First@#1 == First@#2 &)];]

{0.990508, Null}

Timing[two = unionBy[largetest, First];]

{0.006182, Null}

Timing[three =
unsortedUnion[Sort@largetest, (First@#1 == First@#2 &)];]

{0.003405, Null}

one == two == three

True

Bobby

On Sat, 04 Jul 2009 05:43:54 -0500, Leonid Shifrin <lshifr at gmail.com>
wrote:

> Hi,
>
> This problem has been noticed and discussed before.
>
>
> probably also on this group. I also discussed it in my book:
>
> http://www.mathprogramming-intro.org/book/node290.html
>
>
> This is how I understand this issue (I may be wrong):
> in the case of Union,  the main reason for the huge performance
> difference
> is as follows: having a sameness function is a weaker requirement than
> having a  comparison function. For native sameness function SameQ, there
> exists  a corresponding comparison function OrderedQ (or its internal
> equivalent),  therefore O(NlogN) algorithm can be used. For a generic
> comparison function, if one wants to stay completely general, all one
> can do
> is to compare elements pairwise, which leads to O(N^2) complexity.
>
> On top of this, there is another factor (perhaps, less important in the
> case of Union): when SameTest is not specified, the code of Union (or
> Sort
> etc) is entirely internal, written in C and very fast. When we specify
> SameTest, the Union code has to constantly interface with the high -
> level
> Mathematica code (to call SameTest comparison function), and that slows
> it
> down a lot. This is a general situation with Mathematica - there is on
> the
> average an order of magnitude or so gap between built - in and
> best-written
> user-defined functions in Mathematica, if the functionality is not "too
> close" to built-ins so that several of them can be "glued" together with
> a
> minimal performance hit.  One reason for this is that built-ins are
> written
> in a much lower-level compiled language, another (related) is that they
> avoid the main evaluation loop and symbolic rewritings involved in
> general
> Mathematica computations.This shows up to a various degree in many
> problems.
>
> Returning to  Union, I think it would be nice if Union had an option to
> provide a comparison function if it exists, and then use the O(NlogN)
> algorithm. The function below represents  an attempt in this direction:
>
> Clear[unionBy];
> unionBy[list_List, hashF : (_Symbol | _Function)] :=
>   unionBy[list, hashF /@ list];
>
> unionBy[list_List, hashlist_List] /;
>    Length[list] == Length[hashlist] :=
>   With[{ord = Ordering[list]},
>    With[{sorted = list[[ord]], sortedHashlist = hashlist[[ord]]},
>     Extract[sorted,
>      Sort[Union[sortedHashlist] /.
>        Dispatch[MapIndexed[Rule, sortedHashlist]]]]]];
>
> As it follows from its name, this will work if  there exists some kind of
> "hash"
> function that maps your data to another list on which we can use the
> native
> sort/union. The implementation hinges on Sort / Union algorithms being
> stable (that means, they don't displace elements which are already
> "equal").
> Initial sorting of a list and list of hash values is needed so that the
> produced
> result is exactly the same as for standard Union - here we also use the
> fact
>
> that Sort, Ordering and Union  used without SameTest use the same
> underlying
> sorting algorithm with the same native (canonical) sorting function. If,
> instead of the "hash-function", you already have a list of "hashcodes",
> then <unionBy> will be even faster.
>
> Examples:
>
> 1. Union of integer intervals, sametest - their length:
>
> In =
> test = RandomInteger[{1, 100}, {500, 2}];
>
> In = Union[test, SameTest -> (Subtract @@ #1 == Subtract @@ #2 &)] //
>   Short // Timing
>
> Out =
> {0.551,{{1,1},{1,30},{1,57},{1,71},{1,72},<<156>>,{96,8},{98,21},{98,30},{98,36}}}
>
> In = unionBy[test, Subtract @@ # &] // Short // Timing
>
> Out =
> {0.01,{{1,1},{1,30},{1,57},{1,71},{1,72},<<156>>,{96,8},{98,21},{98,30},{98,36}}}
>
> In = unionBy[test, Subtract @@ # &] ===
>  Union[test, SameTest -> (Subtract @@ #1 == Subtract @@ #2 &)]
>
> Out = True
>
> 2. Your example (smaller sample):
>
> In =
> largetest = RandomInteger[{1, 100}, {1000, 2, 12}];
>
> In = Union[largetest, SameTest -> (First@#1 == First@#2 &)] //
>   Short // Timing
>
> Out  =
> {6.7,{{{1,2,30,69,19,67,70,65,56,79,77,72},{37,50,4,73,<<4>>,83,47,73,31}},<<999>>}}
>
> In =unionBy[largetest, First] // Short // Timing
>
> Out =
> {0.05,{{{1,2,30,69,19,67,70,65,56,79,77,72},{37,50,4,73,<<4>>,83,47,73,31}},<<999>>}}
>
> In = unionBy[largetest, largetest[[All, 1]]] // Short // Timing
>
> Out =
> {0.04,{{{1,2,30,69,19,67,70,65,56,79,77,72},{37,50,4,73,<<4>>,83,47,73,31}},<<999>>}}
>
> In = Union[largetest, SameTest -> (First@#1 == First@#2 &)] ===
>  unionBy[largetest, First]
>
> Out = True
>
> The main idea behind unionBy is described in my book at:
>
> http://www.mathprogramming-intro.org/book/node466.html
>
> where the code is analyzed line by line.
>
> Hope this helps.
>
> Regards,
> Leonid
>
>
>
>
> On Fri, Jul 3, 2009 at 2:40 AM, J Siehler <jsiehler at gmail.com> wrote:
>
>> Hello group!  I was a little surprised and puzzled by the following:
>>
>> In:= \$Version
>> Out= "7.0 for Mac OS X PowerPC (32-bit) (November 10, 2008)"
>>
>> In:= data = RandomInteger[{1, 10}, {5000, 2, 12}];
>> Timing[Union[data];]
>> Out= {0.008138, Null}
>>
>> So far so good; so far so speedy.  But why does this happen:
>>
>> In:= Timing[Union[data, SameTest -> Equal];]
>> Out= {15.313, Null}
>>
>> Or more egregiously,
>>
>> In:= Timing[Union[data, SameTest -> (First@#1 == First@#2 &)];]
>> Out= {86.293, Null}
>>
>> There's nothing special about the form of the data here, just it
>> happens to be similar in form to the data I was working on when I
>> experienced terrible slowdown with the SameTest that I've shown in the
>> third example - which doesn't seem like a demanding comparison.    So
>> can anyone explain the orders-of-magnitude change in times here?  Much
>> appreciated!
>>
>>
>

--
DrMajorBob at bigfoot.com

```

• Prev by Date: Re: Manipulate subtleties?
• Next by Date: Re: Parametric cylinder plots round in one instance and elliptical in
• Previous by thread: Re: Union slowdown when SameTest is specified
• Next by thread: Re: Re: Union slowdown when SameTest is