MathGroup Archive: September 2012 [00335]

[Date Index] [Thread Index] [Author Index]

Re: Fast selection of lots of elements from a large list

To: mathgroup at smc.vnet.net
Subject: [mg128253] Re: Fast selection of lots of elements from a large list
From: Sseziwa Mukasa <mukasa at gmail.com>
Date: Sat, 29 Sep 2012 02:58:54 -0400 (EDT)
Delivered-to: l-mathgroup@mail-archive0.wolfram.com
Delivered-to: l-mathgroup@wolfram.com
Delivered-to: mathgroup-newout@smc.vnet.net
Delivered-to: mathgroup-newsend@smc.vnet.net
References: <20120928024827.F221D6848@smc.vnet.net>

You can extract the rows a little more quickly if you sort them first and take advantage of the fact that they are unique.  My timings are an order of magnitude faster because I'm using integers instead of strings for row IDs.  If you could map your strings to integers you may see a similar performance gain.

Anyway here's my example code:

(Debug) In[19]:= rowIds = Range[600000];
q = RandomSample[rowIds, 1000];
Timing[Map[Position[rowIds, #] &, q]][[1]]
Timing[Block[{result = {}, qSorted = Sort[q], index = 1},
   Do[If[rowIds[[i]] == qSorted[[index]], result = {result, rowIds[[i]]};
     index++]; If[index > Length[q], Break[]], {i, Length[rowIds]}];
   Flatten[result]]][[1]]
(Debug) Out[21]= 5.71901
(Debug) Out[22]= 3.44422

Again note that this is not an apple to apple's comparison.  The second expression extracts the actual row not just its position.

Regards,
	Sseziwa

On Sep 27, 2012, at 10:48 PM, Mark Coleman wrote:

> Greetings,
>
> I've been using Mathematica to perform cluster analysis on a data set with about 600,000 rows and 60 columns. I've had the FindCluster procedure return a unique row identifier (12 character string) rather than the clustered data because I want to "join" these results to another data set for further analysis. To accomplish this I've been using the Position function to identify the element numbers in each cluster.
>
> To give a specific example, my cluster analysis identifiers twevle clusters on my original data set. The first of these clusters contains about 15,000 row identifiers. The extract the corresponding data from other data sets, I find the position of each identifier in my original data set using the simple code
>
> q=clusterResults[[1]]; (* row id's for first cluster *)
> p=Map[Position[rowIDs,#]&,q];
>
> where, "rowIDs" are the first column from the other dataset that contain the same string identifiers (rowIDs has about 600,000 sublists). I then Extract these elements ("rows") from the data set and continue my analysis.
>
> Unfortunately this is quite slow. Doing this on a sample of 1000 elements requires 340 seconds on my desktop computer. Some of my clusters have many tens of thousands of elements. I'm hoping someone can suggest a faster way of doing this.
>
> Thanks,
>
> Mark
>
>

References:
- Fast selection of lots of elements from a large list
  - From: Mark Coleman <markspcoleman@gmail.com>

Prev by Date: How to lock down a Dynamic object in a report

Next by Date: Re: Crashing every other launch?

Previous by thread: Fast selection of lots of elements from a large list

Next by thread: Re: Fast selection of lots of elements from a large list