Re: Problem with BinLists for large lists
- To: mathgroup at smc.vnet.net
- Subject: [mg101307] Re: [mg101293] Problem with BinLists for large lists
- From: Darren Glosemeyer <darreng at wolfram.com>
- Date: Wed, 1 Jul 2009 06:32:03 -0400 (EDT)
- References: <200906301034.GAA10625@smc.vnet.net>
pfalloon wrote: > Hi, > I've been trying to use BinLists to bucket large (approx. a million > elements) lists of 3 dimensional vectors. I've noticed that BinLists > runs pretty slowly when working with large lists in general, and for > lists of this size and dimensionality actually runs out of memory. > > I've worked up an alternative implementation which runs pretty > quickly, and works by partitioning component-wise, then finding > intersections of each bucket. I thought it might be worth posting to > see if anyone has encountered this issue and knows more about why it > happens... > > Example implementation: > > (* generate some large lists of reals *) > {x,y,z} = RandomReal[{0,1}, {3,10^6}]; > xyz = Transpose[{x, y, z}]; > > (* define bin intervals *) > int = Range[0,1,0.1]; > > (* calculate buckets for each variable *) > {xBins, yBins, zBins} = Table[Flatten@Position[#, _?(int[[i]] <= # < > int[[i+1]] &)], {i, Length[int]-1}] & /@ {x, y, z}; > > (* get 3x3 grid of bins by taking intersections of the single-variable > bins *) > xyzBins = Table[Intersection[xBins[[i]], yBins[[j]], zBins[[k]]], > {i,Length@xBins}, {j,Length@yBins}, {k,Length@zBins}]; > > (* look at the length of each bins *) > Map[Length, xyzBins, {3}] > > (* get the actual points in each bin *) > binContents = Map[xyz[[#]] &, xyzBins, {3}]; > > (* look at points from a particular bin *) > ListPlot[Transpose[xyz[[xyzBins[[1,2,3]]]]]] > > (* the equivalent call to BinLists runs out of memory on my machine > (32-bit windows) *) > BinLists[xyz, {int}, {int}, {int}]; > > > I'm not sure why BinLists doesn't use an approach like this. My > suspicion is that it may be using something like (using above > definitions): > > xyzBins = Outer[Intersection, xBins, yBins, zBins]; > > the problem being that the Outer function will effectively cause > multiple copies of each list to be generated and thus runs out of > memory. > > Thanks for the suggestion. We will need to take a closer look at this for a future version. As an important side note, when bins have regular spacing, it will always be faster to use the {min,max,dx} spec than the {{c1,c2,...}} cutoff spec because the dx spec can take advantage of the knowledge that the bins are regularly spaced. In this particular example, BinLists[xyz, {0, 1, 0.1}, {0, 1, 0.1}, {0, 1, 0.1}] is about 11 times faster than the suggested code. Darren Glosemeyer Wolfram Research