Re: Problem with BinLists for large lists

• To: mathgroup at smc.vnet.net
• Subject: [mg101309] Re: Problem with BinLists for large lists
• From: Vince <blueschi at gmail.com>
• Date: Wed, 1 Jul 2009 06:32:25 -0400 (EDT)
• References: <h2cpq4\$af5\$1@smc.vnet.net>

On Jun 30, 6:35 am, pfalloon <pfall... at gmail.com> wrote:
> Hi,
> I've been trying to use BinLists to bucket large (approx. a million
> elements) lists of 3 dimensional vectors. I've noticed that BinLists
> runs pretty slowly when working with large lists in general, and for
> lists of this size and dimensionality actually runs out of memory.
>
> I've worked up an alternative implementation which runs pretty
> quickly, and works by partitioning component-wise, then finding
> intersections of each bucket. I thought it might be worth posting to
> see if anyone has encountered this issue and knows more about why it
> happens...
>
> Example implementation:
>
> (* generate some large lists of reals *)
> {x,y,z} = RandomReal[{0,1}, {3,10^6}];
> xyz = Transpose[{x, y, z}];
>
> (* define bin intervals *)
> int = Range[0,1,0.1];
>
> (* calculate buckets for each variable *)
> {xBins, yBins, zBins} = Table[Flatten@Position[#, _?(int[[i]] <= # <
> int[[i+1]] &)], {i, Length[int]-1}] & /@ {x, y, z};
>
> (* get 3x3 grid of bins by taking intersections of the single-variable
> bins *)
> xyzBins = Table[Intersection[xBins[[i]], yBins[[j]], zBins[[k]]],
> {i,Length@xBins}, {j,Length@yBins}, {k,Length@zBins}];
>
> (* look at the length of each bins *)
> Map[Length, xyzBins, {3}]
>
> (* get the actual points in each bin *)
> binContents = Map[xyz[[#]] &, xyzBins, {3}];
>
> (* look at points from a particular bin *)
> ListPlot[Transpose[xyz[[xyzBins[[1,2,3]]]]]]
>
> (* the equivalent call to BinLists runs out of memory on my machine
> (32-bit windows) *)
> BinLists[xyz, {int}, {int}, {int}];
>
> I'm not sure why BinLists doesn't use an approach like this. My
> suspicion is that it may be using something like (using above
> definitions):
>
> xyzBins = Outer[Intersection, xBins, yBins, zBins];
>
> the problem being that the Outer function will effectively cause
> multiple copies of each list to be generated and thus runs out of
> memory.

After encoutering similar issues, I had to roll my own. Effectively,
it's an efficient GatherBy with "replacement" (to allow for
overlapping bins) and an optional reduction immediately after each bin
production. It incrementally consumes a single local copy of the
original data, to avoid duplicating that original more than once. I
can get similar results with careful use of Nearest (thanks to WRI
technical support), but find that uselessly slow and inefficient for
my actual data volumes.

I hope WRI offers something similar in the future (a hyper-efficient
GatherBy with replacement).

Vince Virgilio

• Prev by Date: Re: Problem with BinLists for large lists
• Next by Date: Re: Re: Is Orange translucent? - What Methods
• Previous by thread: Re: Problem with BinLists for large lists
• Next by thread: Re: Re: Is Orange translucent? - What Methods