Re: Problem with BinLists for large lists
- To: mathgroup at smc.vnet.net
- Subject: [mg101307] Re: [mg101293] Problem with BinLists for large lists
- From: Darren Glosemeyer <darreng at wolfram.com>
- Date: Wed, 1 Jul 2009 06:32:03 -0400 (EDT)
- References: <200906301034.GAA10625@smc.vnet.net>
pfalloon wrote:
> Hi,
> I've been trying to use BinLists to bucket large (approx. a million
> elements) lists of 3 dimensional vectors. I've noticed that BinLists
> runs pretty slowly when working with large lists in general, and for
> lists of this size and dimensionality actually runs out of memory.
>
> I've worked up an alternative implementation which runs pretty
> quickly, and works by partitioning component-wise, then finding
> intersections of each bucket. I thought it might be worth posting to
> see if anyone has encountered this issue and knows more about why it
> happens...
>
> Example implementation:
>
> (* generate some large lists of reals *)
> {x,y,z} = RandomReal[{0,1}, {3,10^6}];
> xyz = Transpose[{x, y, z}];
>
> (* define bin intervals *)
> int = Range[0,1,0.1];
>
> (* calculate buckets for each variable *)
> {xBins, yBins, zBins} = Table[Flatten@Position[#, _?(int[[i]] <= # <
> int[[i+1]] &)], {i, Length[int]-1}] & /@ {x, y, z};
>
> (* get 3x3 grid of bins by taking intersections of the single-variable
> bins *)
> xyzBins = Table[Intersection[xBins[[i]], yBins[[j]], zBins[[k]]],
> {i,Length@xBins}, {j,Length@yBins}, {k,Length@zBins}];
>
> (* look at the length of each bins *)
> Map[Length, xyzBins, {3}]
>
> (* get the actual points in each bin *)
> binContents = Map[xyz[[#]] &, xyzBins, {3}];
>
> (* look at points from a particular bin *)
> ListPlot[Transpose[xyz[[xyzBins[[1,2,3]]]]]]
>
> (* the equivalent call to BinLists runs out of memory on my machine
> (32-bit windows) *)
> BinLists[xyz, {int}, {int}, {int}];
>
>
> I'm not sure why BinLists doesn't use an approach like this. My
> suspicion is that it may be using something like (using above
> definitions):
>
> xyzBins = Outer[Intersection, xBins, yBins, zBins];
>
> the problem being that the Outer function will effectively cause
> multiple copies of each list to be generated and thus runs out of
> memory.
>
>
Thanks for the suggestion. We will need to take a closer look at this
for a future version.
As an important side note, when bins have regular spacing, it will
always be faster to use the {min,max,dx} spec than the {{c1,c2,...}}
cutoff spec because the dx spec can take advantage of the knowledge that
the bins are regularly spaced. In this particular example,
BinLists[xyz, {0, 1, 0.1}, {0, 1, 0.1}, {0, 1, 0.1}]
is about 11 times faster than the suggested code.
Darren Glosemeyer
Wolfram Research