Re: Re: DeleteDuplicates is too slow?

• To: mathgroup at smc.vnet.net
• Subject: [mg107288] Re: [mg107180] Re: DeleteDuplicates is too slow?
• From: Leonid Shifrin <lshifr at gmail.com>
• Date: Mon, 8 Feb 2010 03:34:38 -0500 (EST)
• References: <201002050819.DAA06508@smc.vnet.net>

```Hi Clint,

Sorry, I misinterpreted your original problem (did not notice that you only
care about the first element when testing for equality). Trying to
compensate for this: below is a solution which will give an order of
magnitude speedup, *if* you work on (positive) integer values (the case of
possible negative values can be rather easily incorporated if needed):

Clear[sparseArrayElements];
sparseArrayElements[HoldPattern[SparseArray[u___]]] := {u}[[4, 3]]

Clear[deleteDuplicates];
Options[deleteDuplicates] = {Ordered -> True, Threshold ->  1000000};
deleteDuplicates[data_List, opts___?OptionQ] :=
Module[{fdata = data[[All, 1]], parr,
rlen = Range[Length[data], 1, -1],
preserveOrder =
Ordered /. Flatten[{opts}] /. Options[deleteDuplicates],
threshold =
Threshold /. Flatten[{opts}] /. Options[deleteDuplicates], dim },
dim = Max[fdata];
parr =
If[dim < threshold, Table[0, {dim}], SparseArray[{}, dim, 0]];
parr[[fdata[[rlen]]]] = rlen;
parr =
sparseArrayElements@If[dim < threshold, SparseArray@parr, parr];
data[[If[preserveOrder, Sort@parr, parr]]]];

This exploits the technique proposed recently by Norbert Pozar. If you don't
care about your items being ordered, you may set the option Ordered->False
and get some additional speed-up. The Threshold parameter defines the
maximal data value (which serves as a length of auxiliary array), up to
which the "normal" form of array is used. If the first element of any data
point exceeds the threshold, the SparseArray representation is used.

Benchmarks:

In[6]:= data = RandomInteger[{1, 100000}, {30000, 2}];

In[7]:= Timing[
t1 = #[[1]] & /@ Tally[data, #1[[1]] == #2[[1]] &];][[1]]

Out[7]= 0.19

In[8]:= Timing[t2 = #[[1]] & /@ GatherBy[data, First];][[1]]

Out[8]= 0.511

In[9]:= Timing[t3 = deleteDuplicates[data];][[1]]

Out[9]= 0.04

In[10]:= Timing[t4 = deleteDuplicates[data, Ordered -> False];][[1]]

Out[10]= 0.01

In[11]:= t1 == t2 == t3

Out[11]= True

In[12]:= data = RandomInteger[{1, 100000}, {500000, 2}];

In[13]:= Timing[
t1 = #[[1]] & /@ Tally[data, #1[[1]] == #2[[1]] &];][[1]]

Out[13]= 2.844

In[14]:= Timing[t2 = #[[1]] & /@ GatherBy[data, First];][[1]]

Out[14]= 1.983

In[15]:= Timing[t3 = deleteDuplicates[data];][[1]]

Out[15]= 0.241

In[16]:= Timing[t4 = deleteDuplicates[data, Ordered -> False];][[1]]

Out[16]= 0.18

In[17]:= t3 == t2 == t1

Out[17]= True

In[18]:= data = RandomInteger[{1, 10000000}, {100000, 2}];

In[19]:= Timing[
t1 = #[[1]] & /@ Tally[data, #1[[1]] == #2[[1]] &];][[1]]

Out[19]= 1.242

In[20]:= Timing[t2 = #[[1]] & /@ GatherBy[data, First];][[1]]

Out[20]= 2.593

In[21]:= Timing[t3 = deleteDuplicates[data];][[1]]

Out[21]= 0.281

In[22]:= Timing[t4 = deleteDuplicates[data, Ordered -> False];][[1]]

Out[22]= 0.2

In[23]:= t3 == t2 == t1

Out[23]= True

Regards,
Leonid

On Fri, Feb 5, 2010 at 12:19 AM, Clint Zeringue <info at scienceskillz.com>wrote:

> Thanks for all the great responses.
>
> I have summarized them below.
>
> Use Tally or, even better, GatherBy, to obtain very substantial reductions
> in time:
>
> In[1]:= data=RandomInteger[{1,99},{100000,2}];
>
> In[2]:=
> sameQ[_,_]=False;
> sameQ[{x_,y_},{x_,z_}]=True;
>
> In[4]:= Timing[t0=DeleteDuplicates[data,sameQ];]
> Out[4]= {7.987,Null}
>
> In[5]:= Timing[t1=#[[1]]&/@Tally[data,#1[[1]]==#2[[1]]&];][[1]]
> Out[5]= 0.063
>
> In[6]:= Timing[t2=#[[1]]&/@GatherBy[data,First];][[1]]
> Out[6]= 0.016
>
> In[7]:= t0===t1===t2
> Out[7]= True
>
> Tomas
> Tomas Garza [tgarza10 at msn.com]
>
> n = 100000;
>
> data = RandomInteger[{0, 9}, {n, 2}] // N;
>
> Length[DeleteDuplicates[data, #1[[1]] === #2[[1]] &]] // Timing
>
> {1.39164,10}
>
> Length[Union[data, SameTest -> (#1[[1]] === #2[[1]] &)]] // Timing
>
> {0.289784,10}
>
>
> Bob Hanlon
> Bob Hanlon [hanlonr at cox.net]
>
>
>
> Take care not to use N as a variable as it already has a built-in meaning.
>
> If it is not necessary to keep the elements of the list in the same order,
> then a different, lower complexity algorithm can be used:
>
> SplitBy[SortBy[data, First], First][[All, 1]]
>
> This will be much faster, but will not remove exactly the same elements as
> DeleteDuplicates because the second element of the pairs is always ignored.
>  DeleteDuplicates will always keep the very first occurrence of equivalent
> elements.  Is this important for your calculation?
>
> Szabolcs Horv=C3=A1t [szhorvat at gmail.com]
>
> The fastest solution was Tomas Garza's :
>
> Timing[t1=#[[1]]&/@Tally[data,#1[[1]]==#2[[1]]&];][[
>
>

```

• Prev by Date: Re: parallel table
• Next by Date: Re: Re: Could you prove this proposition:the
• Previous by thread: Re: DeleteDuplicates is too slow?
• Next by thread: Re: Using "If" and "NDSolve" together