MathGroup Archive: February 2010 [00181]

[Date Index] [Thread Index] [Author Index]

Re: DeleteDuplicates is too slow?

To: mathgroup at smc.vnet.net
Subject: [mg107180] Re: DeleteDuplicates is too slow?
From: Clint Zeringue <info at scienceskillz.com>
Date: Fri, 5 Feb 2010 03:19:48 -0500 (EST)

Thanks for all the great responses. 

I have summarized them below.

Use Tally or, even better, GatherBy, to obtain very substantial reductions in time:

In[1]:= data=RandomInteger[{1,99},{100000,2}];

In[2]:= 
sameQ[_,_]=False;
sameQ[{x_,y_},{x_,z_}]=True;

In[4]:= Timing[t0=DeleteDuplicates[data,sameQ];]
Out[4]= {7.987,Null}

In[5]:= Timing[t1=#[[1]]&/@Tally[data,#1[[1]]==#2[[1]]&];][[1]]
Out[5]= 0.063

In[6]:= Timing[t2=#[[1]]&/@GatherBy[data,First];][[1]]
Out[6]= 0.016

In[7]:= t0===t1===t2
Out[7]= True

Tomas
Tomas Garza [tgarza10 at msn.com]

n = 100000;

data = RandomInteger[{0, 9}, {n, 2}] // N;

Length[DeleteDuplicates[data, #1[[1]] === #2[[1]] &]] // Timing

{1.39164,10}

Length[Union[data, SameTest -> (#1[[1]] === #2[[1]] &)]] // Timing

{0.289784,10}


Bob Hanlon
Bob Hanlon [hanlonr at cox.net]



Take care not to use N as a variable as it already has a built-in meaning.

If it is not necessary to keep the elements of the list in the same order, then a different, lower complexity algorithm can be used:

SplitBy[SortBy[data, First], First][[All, 1]]

This will be much faster, but will not remove exactly the same elements as DeleteDuplicates because the second element of the pairs is always ignored.  DeleteDuplicates will always keep the very first occurrence of equivalent elements.  Is this important for your calculation?

Szabolcs HorvÃ¡t [szhorvat at gmail.com]

The fastest solution was Tomas Garza's :

Timing[t1=#[[1]]&/@Tally[data,#1[[1]]==#2[[1]]&];][[

Follow-Ups:
- Re: Re: DeleteDuplicates is too slow?
  - From: Leonid Shifrin <lshifr@gmail.com>

Prev by Date: Re: Diagonalizing large matrices

Next by Date: Re: Bug? Analytical integration of cosines gets the sign wrong

Previous by thread: Re: Re: Re: DeleteDuplicates is too slow?

Next by thread: Re: Re: DeleteDuplicates is too slow?