MathGroup Archive: July 2009 [00080]

[Date Index] [Thread Index] [Author Index]
Re: Re: Re: Union slowdown when
To: mathgroup at smc.vnet.net
Subject: [mg101413] Re: [mg101400] Re: [mg101389] Re: [mg101378] Union slowdown when
From: DrMajorBob <btreat1 at austin.rr.com>
Date: Sun, 5 Jul 2009 23:16:46 -0400 (EDT)
References: <200907030940.FAA19348@smc.vnet.net>
Reply-to: drmajorbob at bigfoot.com
Apparently, Tally doesn't slow down for custom SameTest functions, in the  
way that Union does.

I suppose that's because Union wants an ordering, while Tally needs only a  
test for equivalence.

Bobby

On Sun, 05 Jul 2009 03:46:53 -0500, Leonid Shifrin <lshifr at gmail.com>  
wrote:

> Agreed. Nice to know, thanks. Mine will work on v.5 and even v.4.1 though
> (although this is probably irrelevant now), and also I think is a bit  
> more
> pedagogical since it deals explicitly with the same problem that Tally
> handles
> internally and showcases the power of Dispatch. But for production,
> certainly yours is better.
>
> Regards,
> Leonid
>
>
> On Sat, Jul 4, 2009 at 1:24 PM, DrMajorBob <btreat1 at austin.rr.com> wrote:
>
>> Here's a faster and (arguably) simpler method, using unsortedUnion  
>> (Tally):
>>
>> unsortedUnion[x_List] := Tally[x][[All, 1]]
>> unsortedUnion[x_List, SameTest -> test : (_Symbol | _Function)] :=
>>  Tally[x, test][[All, 1]]
>> unsortedUnion[x_List, test : (_Symbol | _Function)] :=
>>  Tally[x, test][[All, 1]]
>>
>> test = RandomInteger[{1, 100}, {500, 2}];
>>
>> Timing[one =
>>   Union[test, SameTest -> (Subtract @@ #1 == Subtract @@ #2 &)];]
>>
>> {0.087628, Null}
>>
>> Timing[two = unionBy[test, Subtract @@ # &];]
>>
>> {0.002282, Null}
>>
>> Timing[three =
>>   unsortedUnion[Sort@test, (Subtract @@ #1 == Subtract @@ #2 &)];]
>>
>> {0.00181, Null}
>>
>> one == two == three
>>
>> True
>>
>> largetest = RandomInteger[{1, 100}, {1000, 2, 12}];
>>
>> Timing[one = Union[largetest, SameTest -> (First@#1 == First@#2 &)];]
>>
>> {0.990508, Null}
>>
>> Timing[two = unionBy[largetest, First];]
>>
>> {0.006182, Null}
>>
>> Timing[three =
>>   unsortedUnion[Sort@largetest, (First@#1 == First@#2 &)];]
>>
>> {0.003405, Null}
>>
>> one == two == three
>>
>> True
>>
>> Bobby
>>
>>
>> On Sat, 04 Jul 2009 05:43:54 -0500, Leonid Shifrin <lshifr at gmail.com>
>> wrote:
>>
>>  Hi,
>>>
>>> This problem has been noticed and discussed before.
>>>
>>> http://www.verbeia.com/mathematica/tips/HTMLLinks/Tricks_Misc_11.html,
>>>
>>> probably also on this group. I also discussed it in my book:
>>>
>>> http://www.mathprogramming-intro.org/book/node290.html
>>>
>>>
>>> This is how I understand this issue (I may be wrong):
>>> in the case of Union,  the main reason for the huge performance  
>>> difference
>>> is as follows: having a sameness function is a weaker requirement than
>>> having a  comparison function. For native sameness function SameQ,  
>>> there
>>> exists  a corresponding comparison function OrderedQ (or its internal
>>> equivalent),  therefore O(NlogN) algorithm can be used. For a generic
>>> comparison function, if one wants to stay completely general, all one  
>>> can
>>> do
>>> is to compare elements pairwise, which leads to O(N^2) complexity.
>>>
>>> On top of this, there is another factor (perhaps, less important in the
>>> case of Union): when SameTest is not specified, the code of Union (or  
>>> Sort
>>> etc) is entirely internal, written in C and very fast. When we specify
>>> SameTest, the Union code has to constantly interface with the high -  
>>> level
>>> Mathematica code (to call SameTest comparison function), and that  
>>> slows it
>>> down a lot. This is a general situation with Mathematica - there is on  
>>> the
>>> average an order of magnitude or so gap between built - in and
>>> best-written
>>> user-defined functions in Mathematica, if the functionality is not "too
>>> close" to built-ins so that several of them can be "glued" together  
>>> with a
>>> minimal performance hit.  One reason for this is that built-ins are
>>> written
>>> in a much lower-level compiled language, another (related) is that they
>>> avoid the main evaluation loop and symbolic rewritings involved in  
>>> general
>>> Mathematica computations.This shows up to a various degree in many
>>> problems.
>>>
>>> Returning to  Union, I think it would be nice if Union had an option to
>>> provide a comparison function if it exists, and then use the O(NlogN)
>>> algorithm. The function below represents  an attempt in this direction:
>>>
>>> Clear[unionBy];
>>> unionBy[list_List, hashF : (_Symbol | _Function)] :=
>>>  unionBy[list, hashF /@ list];
>>>
>>> unionBy[list_List, hashlist_List] /;
>>>   Length[list] == Length[hashlist] :=
>>>  With[{ord = Ordering[list]},
>>>   With[{sorted = list[[ord]], sortedHashlist = hashlist[[ord]]},
>>>    Extract[sorted,
>>>     Sort[Union[sortedHashlist] /.
>>>       Dispatch[MapIndexed[Rule, sortedHashlist]]]]]];
>>>
>>> As it follows from its name, this will work if  there exists some kind  
>>> of
>>> "hash"
>>> function that maps your data to another list on which we can use the
>>> native
>>> sort/union. The implementation hinges on Sort / Union algorithms being
>>> stable (that means, they don't displace elements which are already
>>> "equal").
>>> Initial sorting of a list and list of hash values is needed so that the
>>> produced
>>> result is exactly the same as for standard Union - here we also use the
>>> fact
>>>
>>> that Sort, Ordering and Union  used without SameTest use the same
>>> underlying
>>> sorting algorithm with the same native (canonical) sorting function.  
>>> If,
>>> instead of the "hash-function", you already have a list of "hashcodes",
>>> then <unionBy> will be even faster.
>>>
>>> Examples:
>>>
>>> 1. Union of integer intervals, sametest - their length:
>>>
>>> In[1] =
>>> test = RandomInteger[{1, 100}, {500, 2}];
>>>
>>> In[2] = Union[test, SameTest -> (Subtract @@ #1 == Subtract @@ #2 &)]  
>>> //
>>>  Short // Timing
>>>
>>> Out[2] =
>>>
>>> {0.551,{{1,1},{1,30},{1,57},{1,71},{1,72},<<156>>,{96,8},{98,21},{98,30},{98,36}}}
>>>
>>> In[3] = unionBy[test, Subtract @@ # &] // Short // Timing
>>>
>>> Out[3] =
>>>
>>> {0.01,{{1,1},{1,30},{1,57},{1,71},{1,72},<<156>>,{96,8},{98,21},{98,30},{98,36}}}
>>>
>>> In[4] = unionBy[test, Subtract @@ # &] ===
>>>  Union[test, SameTest -> (Subtract @@ #1 == Subtract @@ #2 &)]
>>>
>>> Out[4] = True
>>>
>>> 2. Your example (smaller sample):
>>>
>>> In[5] =
>>> largetest = RandomInteger[{1, 100}, {1000, 2, 12}];
>>>
>>> In[6] = Union[largetest, SameTest -> (First@#1 == First@#2 &)] //
>>>  Short // Timing
>>>
>>> Out[6]  =
>>>
>>> {6.7,{{{1,2,30,69,19,67,70,65,56,79,77,72},{37,50,4,73,<<4>>,83,47,73,31}},<<999>>}}
>>>
>>> In[7] =unionBy[largetest, First] // Short // Timing
>>>
>>> Out[7] =
>>>
>>> {0.05,{{{1,2,30,69,19,67,70,65,56,79,77,72},{37,50,4,73,<<4>>,83,47,73,31}},<<999>>}}
>>>
>>> In[8] = unionBy[largetest, largetest[[All, 1]]] // Short // Timing
>>>
>>> Out[8] =
>>>
>>> {0.04,{{{1,2,30,69,19,67,70,65,56,79,77,72},{37,50,4,73,<<4>>,83,47,73,31}},<<999>>}}
>>>
>>> In[9] = Union[largetest, SameTest -> (First@#1 == First@#2 &)] ===
>>>  unionBy[largetest, First]
>>>
>>> Out[9] = True
>>>
>>> The main idea behind unionBy is described in my book at:
>>>
>>> http://www.mathprogramming-intro.org/book/node466.html
>>>
>>> where the code is analyzed line by line.
>>>
>>> Hope this helps.
>>>
>>> Regards,
>>> Leonid
>>>
>>>
>>>
>>>
>>> On Fri, Jul 3, 2009 at 2:40 AM, J Siehler <jsiehler at gmail.com> wrote:
>>>
>>>  Hello group!  I was a little surprised and puzzled by the following:
>>>>
>>>> In[1]:= $Version
>>>> Out[1]= "7.0 for Mac OS X PowerPC (32-bit) (November 10, 2008)"
>>>>
>>>> In[2]:= data = RandomInteger[{1, 10}, {5000, 2, 12}];
>>>> Timing[Union[data];]
>>>> Out[3]= {0.008138, Null}
>>>>
>>>> So far so good; so far so speedy.  But why does this happen:
>>>>
>>>> In[4]:= Timing[Union[data, SameTest -> Equal];]
>>>> Out[4]= {15.313, Null}
>>>>
>>>> Or more egregiously,
>>>>
>>>> In[5]:= Timing[Union[data, SameTest -> (First@#1 == First@#2 &)];]
>>>> Out[5]= {86.293, Null}
>>>>
>>>> There's nothing special about the form of the data here, just it
>>>> happens to be similar in form to the data I was working on when I
>>>> experienced terrible slowdown with the SameTest that I've shown in the
>>>> third example - which doesn't seem like a demanding comparison.    So
>>>> can anyone explain the orders-of-magnitude change in times here?  Much
>>>> appreciated!
>>>>
>>>>
>>>>
>>>
>>
>>
>> --
>> DrMajorBob at bigfoot.com
>>
>
>



-- 
DrMajorBob at bigfoot.com
References:
- Union slowdown when SameTest is specified
  - From: J Siehler <jsiehler@gmail.com>
Prev by Date: Re: 3D plot with orthographic projection
Next by Date: piechart with percent...
Previous by thread: Re: Re: Union slowdown when SameTest is
Next by thread: Re: Union slowdown when SameTest is specified