Re: Re: DeleteDuplicates is too slow?
- To: mathgroup at smc.vnet.net
- Subject: [mg107288] Re: [mg107180] Re: DeleteDuplicates is too slow?
- From: Leonid Shifrin <lshifr at gmail.com>
- Date: Mon, 8 Feb 2010 03:34:38 -0500 (EST)
- References: <201002050819.DAA06508@smc.vnet.net>
Hi Clint, Sorry, I misinterpreted your original problem (did not notice that you only care about the first element when testing for equality). Trying to compensate for this: below is a solution which will give an order of magnitude speedup, *if* you work on (positive) integer values (the case of possible negative values can be rather easily incorporated if needed): Clear[sparseArrayElements]; sparseArrayElements[HoldPattern[SparseArray[u___]]] := {u}[[4, 3]] Clear[deleteDuplicates]; Options[deleteDuplicates] = {Ordered -> True, Threshold -> 1000000}; deleteDuplicates[data_List, opts___?OptionQ] := Module[{fdata = data[[All, 1]], parr, rlen = Range[Length[data], 1, -1], preserveOrder = Ordered /. Flatten[{opts}] /. Options[deleteDuplicates], threshold = Threshold /. Flatten[{opts}] /. Options[deleteDuplicates], dim }, dim = Max[fdata]; parr = If[dim < threshold, Table[0, {dim}], SparseArray[{}, dim, 0]]; parr[[fdata[[rlen]]]] = rlen; parr = sparseArrayElements@If[dim < threshold, SparseArray@parr, parr]; data[[If[preserveOrder, Sort@parr, parr]]]]; This exploits the technique proposed recently by Norbert Pozar. If you don't care about your items being ordered, you may set the option Ordered->False and get some additional speed-up. The Threshold parameter defines the maximal data value (which serves as a length of auxiliary array), up to which the "normal" form of array is used. If the first element of any data point exceeds the threshold, the SparseArray representation is used. Benchmarks: In[6]:= data = RandomInteger[{1, 100000}, {30000, 2}]; In[7]:= Timing[ t1 = #[[1]] & /@ Tally[data, #1[[1]] == #2[[1]] &];][[1]] Out[7]= 0.19 In[8]:= Timing[t2 = #[[1]] & /@ GatherBy[data, First];][[1]] Out[8]= 0.511 In[9]:= Timing[t3 = deleteDuplicates[data];][[1]] Out[9]= 0.04 In[10]:= Timing[t4 = deleteDuplicates[data, Ordered -> False];][[1]] Out[10]= 0.01 In[11]:= t1 == t2 == t3 Out[11]= True In[12]:= data = RandomInteger[{1, 100000}, {500000, 2}]; In[13]:= Timing[ t1 = #[[1]] & /@ Tally[data, #1[[1]] == #2[[1]] &];][[1]] Out[13]= 2.844 In[14]:= Timing[t2 = #[[1]] & /@ GatherBy[data, First];][[1]] Out[14]= 1.983 In[15]:= Timing[t3 = deleteDuplicates[data];][[1]] Out[15]= 0.241 In[16]:= Timing[t4 = deleteDuplicates[data, Ordered -> False];][[1]] Out[16]= 0.18 In[17]:= t3 == t2 == t1 Out[17]= True In[18]:= data = RandomInteger[{1, 10000000}, {100000, 2}]; In[19]:= Timing[ t1 = #[[1]] & /@ Tally[data, #1[[1]] == #2[[1]] &];][[1]] Out[19]= 1.242 In[20]:= Timing[t2 = #[[1]] & /@ GatherBy[data, First];][[1]] Out[20]= 2.593 In[21]:= Timing[t3 = deleteDuplicates[data];][[1]] Out[21]= 0.281 In[22]:= Timing[t4 = deleteDuplicates[data, Ordered -> False];][[1]] Out[22]= 0.2 In[23]:= t3 == t2 == t1 Out[23]= True Regards, Leonid On Fri, Feb 5, 2010 at 12:19 AM, Clint Zeringue <info at scienceskillz.com>wrote: > Thanks for all the great responses. > > I have summarized them below. > > Use Tally or, even better, GatherBy, to obtain very substantial reductions > in time: > > In[1]:= data=RandomInteger[{1,99},{100000,2}]; > > In[2]:= > sameQ[_,_]=False; > sameQ[{x_,y_},{x_,z_}]=True; > > In[4]:= Timing[t0=DeleteDuplicates[data,sameQ];] > Out[4]= {7.987,Null} > > In[5]:= Timing[t1=#[[1]]&/@Tally[data,#1[[1]]==#2[[1]]&];][[1]] > Out[5]= 0.063 > > In[6]:= Timing[t2=#[[1]]&/@GatherBy[data,First];][[1]] > Out[6]= 0.016 > > In[7]:= t0===t1===t2 > Out[7]= True > > Tomas > Tomas Garza [tgarza10 at msn.com] > > n = 100000; > > data = RandomInteger[{0, 9}, {n, 2}] // N; > > Length[DeleteDuplicates[data, #1[[1]] === #2[[1]] &]] // Timing > > {1.39164,10} > > Length[Union[data, SameTest -> (#1[[1]] === #2[[1]] &)]] // Timing > > {0.289784,10} > > > Bob Hanlon > Bob Hanlon [hanlonr at cox.net] > > > > Take care not to use N as a variable as it already has a built-in meaning. > > If it is not necessary to keep the elements of the list in the same order, > then a different, lower complexity algorithm can be used: > > SplitBy[SortBy[data, First], First][[All, 1]] > > This will be much faster, but will not remove exactly the same elements as > DeleteDuplicates because the second element of the pairs is always ignored. > DeleteDuplicates will always keep the very first occurrence of equivalent > elements. Is this important for your calculation? > > Szabolcs Horv=C3=A1t [szhorvat at gmail.com] > > The fastest solution was Tomas Garza's : > > Timing[t1=#[[1]]&/@Tally[data,#1[[1]]==#2[[1]]&];][[ > >
- References:
- Re: DeleteDuplicates is too slow?
- From: Clint Zeringue <info@scienceskillz.com>
- Re: DeleteDuplicates is too slow?