Re: Re: DeleteDuplicates is too slow?
- To: mathgroup at smc.vnet.net
- Subject: [mg107288] Re: [mg107180] Re: DeleteDuplicates is too slow?
- From: Leonid Shifrin <lshifr at gmail.com>
- Date: Mon, 8 Feb 2010 03:34:38 -0500 (EST)
- References: <201002050819.DAA06508@smc.vnet.net>
Hi Clint,
Sorry, I misinterpreted your original problem (did not notice that you only
care about the first element when testing for equality). Trying to
compensate for this: below is a solution which will give an order of
magnitude speedup, *if* you work on (positive) integer values (the case of
possible negative values can be rather easily incorporated if needed):
Clear[sparseArrayElements];
sparseArrayElements[HoldPattern[SparseArray[u___]]] := {u}[[4, 3]]
Clear[deleteDuplicates];
Options[deleteDuplicates] = {Ordered -> True, Threshold -> 1000000};
deleteDuplicates[data_List, opts___?OptionQ] :=
Module[{fdata = data[[All, 1]], parr,
rlen = Range[Length[data], 1, -1],
preserveOrder =
Ordered /. Flatten[{opts}] /. Options[deleteDuplicates],
threshold =
Threshold /. Flatten[{opts}] /. Options[deleteDuplicates], dim },
dim = Max[fdata];
parr =
If[dim < threshold, Table[0, {dim}], SparseArray[{}, dim, 0]];
parr[[fdata[[rlen]]]] = rlen;
parr =
sparseArrayElements@If[dim < threshold, SparseArray@parr, parr];
data[[If[preserveOrder, Sort@parr, parr]]]];
This exploits the technique proposed recently by Norbert Pozar. If you don't
care about your items being ordered, you may set the option Ordered->False
and get some additional speed-up. The Threshold parameter defines the
maximal data value (which serves as a length of auxiliary array), up to
which the "normal" form of array is used. If the first element of any data
point exceeds the threshold, the SparseArray representation is used.
Benchmarks:
In[6]:= data = RandomInteger[{1, 100000}, {30000, 2}];
In[7]:= Timing[
t1 = #[[1]] & /@ Tally[data, #1[[1]] == #2[[1]] &];][[1]]
Out[7]= 0.19
In[8]:= Timing[t2 = #[[1]] & /@ GatherBy[data, First];][[1]]
Out[8]= 0.511
In[9]:= Timing[t3 = deleteDuplicates[data];][[1]]
Out[9]= 0.04
In[10]:= Timing[t4 = deleteDuplicates[data, Ordered -> False];][[1]]
Out[10]= 0.01
In[11]:= t1 == t2 == t3
Out[11]= True
In[12]:= data = RandomInteger[{1, 100000}, {500000, 2}];
In[13]:= Timing[
t1 = #[[1]] & /@ Tally[data, #1[[1]] == #2[[1]] &];][[1]]
Out[13]= 2.844
In[14]:= Timing[t2 = #[[1]] & /@ GatherBy[data, First];][[1]]
Out[14]= 1.983
In[15]:= Timing[t3 = deleteDuplicates[data];][[1]]
Out[15]= 0.241
In[16]:= Timing[t4 = deleteDuplicates[data, Ordered -> False];][[1]]
Out[16]= 0.18
In[17]:= t3 == t2 == t1
Out[17]= True
In[18]:= data = RandomInteger[{1, 10000000}, {100000, 2}];
In[19]:= Timing[
t1 = #[[1]] & /@ Tally[data, #1[[1]] == #2[[1]] &];][[1]]
Out[19]= 1.242
In[20]:= Timing[t2 = #[[1]] & /@ GatherBy[data, First];][[1]]
Out[20]= 2.593
In[21]:= Timing[t3 = deleteDuplicates[data];][[1]]
Out[21]= 0.281
In[22]:= Timing[t4 = deleteDuplicates[data, Ordered -> False];][[1]]
Out[22]= 0.2
In[23]:= t3 == t2 == t1
Out[23]= True
Regards,
Leonid
On Fri, Feb 5, 2010 at 12:19 AM, Clint Zeringue <info at scienceskillz.com>wrote:
> Thanks for all the great responses.
>
> I have summarized them below.
>
> Use Tally or, even better, GatherBy, to obtain very substantial reductions
> in time:
>
> In[1]:= data=RandomInteger[{1,99},{100000,2}];
>
> In[2]:=
> sameQ[_,_]=False;
> sameQ[{x_,y_},{x_,z_}]=True;
>
> In[4]:= Timing[t0=DeleteDuplicates[data,sameQ];]
> Out[4]= {7.987,Null}
>
> In[5]:= Timing[t1=#[[1]]&/@Tally[data,#1[[1]]==#2[[1]]&];][[1]]
> Out[5]= 0.063
>
> In[6]:= Timing[t2=#[[1]]&/@GatherBy[data,First];][[1]]
> Out[6]= 0.016
>
> In[7]:= t0===t1===t2
> Out[7]= True
>
> Tomas
> Tomas Garza [tgarza10 at msn.com]
>
> n = 100000;
>
> data = RandomInteger[{0, 9}, {n, 2}] // N;
>
> Length[DeleteDuplicates[data, #1[[1]] === #2[[1]] &]] // Timing
>
> {1.39164,10}
>
> Length[Union[data, SameTest -> (#1[[1]] === #2[[1]] &)]] // Timing
>
> {0.289784,10}
>
>
> Bob Hanlon
> Bob Hanlon [hanlonr at cox.net]
>
>
>
> Take care not to use N as a variable as it already has a built-in meaning.
>
> If it is not necessary to keep the elements of the list in the same order,
> then a different, lower complexity algorithm can be used:
>
> SplitBy[SortBy[data, First], First][[All, 1]]
>
> This will be much faster, but will not remove exactly the same elements as
> DeleteDuplicates because the second element of the pairs is always ignored.
> DeleteDuplicates will always keep the very first occurrence of equivalent
> elements. Is this important for your calculation?
>
> Szabolcs Horv=C3=A1t [szhorvat at gmail.com]
>
> The fastest solution was Tomas Garza's :
>
> Timing[t1=#[[1]]&/@Tally[data,#1[[1]]==#2[[1]]&];][[
>
>
- References:
- Re: DeleteDuplicates is too slow?
- From: Clint Zeringue <info@scienceskillz.com>
- Re: DeleteDuplicates is too slow?