Re: Problem with BinLists for large lists
- To: mathgroup at smc.vnet.net
- Subject: [mg101309] Re: Problem with BinLists for large lists
- From: Vince <blueschi at gmail.com>
- Date: Wed, 1 Jul 2009 06:32:25 -0400 (EDT)
- References: <h2cpq4$af5$1@smc.vnet.net>
On Jun 30, 6:35 am, pfalloon <pfall... at gmail.com> wrote: > Hi, > I've been trying to use BinLists to bucket large (approx. a million > elements) lists of 3 dimensional vectors. I've noticed that BinLists > runs pretty slowly when working with large lists in general, and for > lists of this size and dimensionality actually runs out of memory. > > I've worked up an alternative implementation which runs pretty > quickly, and works by partitioning component-wise, then finding > intersections of each bucket. I thought it might be worth posting to > see if anyone has encountered this issue and knows more about why it > happens... > > Example implementation: > > (* generate some large lists of reals *) > {x,y,z} = RandomReal[{0,1}, {3,10^6}]; > xyz = Transpose[{x, y, z}]; > > (* define bin intervals *) > int = Range[0,1,0.1]; > > (* calculate buckets for each variable *) > {xBins, yBins, zBins} = Table[Flatten@Position[#, _?(int[[i]] <= # < > int[[i+1]] &)], {i, Length[int]-1}] & /@ {x, y, z}; > > (* get 3x3 grid of bins by taking intersections of the single-variable > bins *) > xyzBins = Table[Intersection[xBins[[i]], yBins[[j]], zBins[[k]]], > {i,Length@xBins}, {j,Length@yBins}, {k,Length@zBins}]; > > (* look at the length of each bins *) > Map[Length, xyzBins, {3}] > > (* get the actual points in each bin *) > binContents = Map[xyz[[#]] &, xyzBins, {3}]; > > (* look at points from a particular bin *) > ListPlot[Transpose[xyz[[xyzBins[[1,2,3]]]]]] > > (* the equivalent call to BinLists runs out of memory on my machine > (32-bit windows) *) > BinLists[xyz, {int}, {int}, {int}]; > > I'm not sure why BinLists doesn't use an approach like this. My > suspicion is that it may be using something like (using above > definitions): > > xyzBins = Outer[Intersection, xBins, yBins, zBins]; > > the problem being that the Outer function will effectively cause > multiple copies of each list to be generated and thus runs out of > memory. After encoutering similar issues, I had to roll my own. Effectively, it's an efficient GatherBy with "replacement" (to allow for overlapping bins) and an optional reduction immediately after each bin production. It incrementally consumes a single local copy of the original data, to avoid duplicating that original more than once. I can get similar results with careful use of Nearest (thanks to WRI technical support), but find that uselessly slow and inefficient for my actual data volumes. I hope WRI offers something similar in the future (a hyper-efficient GatherBy with replacement). Vince Virgilio