MathGroup Archive: September 2009 [00492]

[Date Index] [Thread Index] [Author Index]

Re: Speed up calculating the pair correlation function

To: mathgroup at smc.vnet.net
Subject: [mg103367] Re: [mg103335] Speed up calculating the pair correlation function
From: Daniel Lichtblau <danl at wolfram.com>
Date: Thu, 17 Sep 2009 06:22:17 -0400 (EDT)
References: <200909160946.FAA12977@smc.vnet.net>

Szabolcs Horvát wrote:
> Hello,
> 
> I would like to calculate the pair correlation function normalized to 1 
> for some 2D point data.  I.e. I need to find the mean density of points 
> at distance r from any point, normalized to 1.
> 
> I am looking for advice on speeding this up.
> 
> This is the current implementation I have:
> 
> The pcfOne function calculates the mean density of 'points' at distance 
> r from one single point ('origin'), up to 'rmax' in steps of 'dr'. 
> 'density' is the average density of all points over the complete region 
> (since the shape of the region is unknown to the function, this quantity 
> is passed separately):
> 
> pcfOne[points_, origin_, dr_, rmax_, density_] :=
>   Module[{hist},
>    hist = BinCounts[
>      With[{v = # - origin}, Sqrt[v.v]] & /@ points,
>      {0, rmax, dr}];
>    Transpose[
>     {Range[0, rmax - dr, dr] + dr/2,
>      hist/(Pi (dr^2 + 2 dr Range[0, rmax - dr, dr]) density)}
>    ]
>   ]
> 
> Now we can select a subset of the points, calculate this function for 
> all of them and average the results.  For randomly distributed points 
> the result will be a constant function of value 1 (at least until we get 
> too close to the edge of the region):
> 
> data = RandomReal[1, {50000, 2}];
> 
> ListPlot[
>    Mean[
>     pcfOne[data, #, 0.05, 0.5, Length[data]] & /@
>       Nearest[data, {.5, .5}, 1000]
>    ],
> 
>    PlotRange -> {0, 2}, Axes -> False, Frame -> True
> ]
> 
> This runs in 80 seconds on my machine.  I would like to use this 
> function on datasets of up to 300,000 points and average over more than 
> just 1000 points near the middle, say 10000.  That would take 60 times 
> as long, ~80 minutes, which is way too much.
> 
> Is it possible to speed this up significantly?

If you can settle for an approximation, under some conditions the method 
below can be faster.

(1) Consider your domain as a grid. Place points into a set of discrete 
boxes. Keep track of how many are in each box.

(2) For each origin, iterate over boxes, find distance from box center 
to origin, add box's point count to appropriate bin.

The idea is that with low dimension (e.g. 2 or perhaps 3), and 
sufficiently many points, and not too fine a binning delta, this will 
both give a reasonable approximation and be faster to compute.

Code can be done as below, though there may be still faster approaches.

boxCountsC = Compile[{{pts,_Real,2}, {gran,_Real,1},
   {xrng,_Real,1}, {yrng,_Real,1}}, Module[
     {res, pt, x, y},
     res = Table[0, {j,xrng[[1]],xrng[[2]]-gran[[1]],gran[[1]]},
       {k,yrng[[1]],yrng[[2]]-gran[[2]],gran[[2]]}];
       Do [pt = pts[[j]];
         x = Ceiling[(pt[[1]]-xrng[[1]])/gran];
         y = Ceiling[(pt[[2]]-yrng[[1]])/gran];
         res[[x,y]]++
         , {j,Length[pts]}];
     res
     ]];

pcfOne2C = Compile[{{bins,_Integer,2}, {orig,_Real,1}, {dr,_Real},
   {rmax,_Real}, {density,_Real}, {gran,_Real,1},
   {xrng,_Real,1}, {yrng,_Real,1}}, Module[
     {res, rng, d},
     res = Table[0.,{Ceiling[rmax/dr]}];
     Do [
       d = Ceiling[(Norm[orig -
         {xrng[[1]]+(j-.5)*gran[[1]],yrng[[1]]+(k-.5)*gran[[2]]}])/dr];
         If [d<=Length[res], res[[d]] += bins[[j,k]]];
         , {j,Length[bins]}, {k,Length[bins[[j]]]}
       ];
     rng = Range[0, rmax-dr, dr];
     Transpose[{rng + dr/2., res/(3.14*(dr^2 + 2*dr*rng)*density)}]
     ]];

Let's try your example. I'll use boxes of length .01 for the gridding.

data = RandomReal[1, {50000, 2}];
xrng = {0.,1.};
yrng = {0.,1.};
gran = {.01,.01};

Your code:

Timing[res2 = Mean[
     pcfOne[data, #, 0.05, 0.5, Length[data]] & /@
       Nearest[data, {.5, .5}, 1000]];]
Out[126]= {23.4904, Null}

Above version:

Timing[bc = boxCountsC[data, gran, xrng, yrng];]
Out[128]= {0.122981, Null}

nlist = Nearest[data, {.5, .5}, 1000];

Timing[res = Mean[pcfOne2C[bc, #, 0.05, 0.5, N[Length[data]],
   gran, xrng, yrng] & /@ nlist];]
Out[130]= {7.30289, Null}

The differences are modest.

In[131]:= res-res2
Out[131]= {{0., 0.000729011}, {0., -0.000365749}, {0., 0.000407047},
   {0., 0.00064568}, {0., 0.0013471}, {0., 0.000550493},
   {0., 0.000781529},
  {0., 0.000437688}, {0., 0.000663624}, {0., 0.00048734}}

If I use ten times as many points, the initial gridding takes ten times 
longer, but the pcfOne2C iteration is just as fast as before (it is 
affected by the granularity, not the number of points).

With many origins, it might make sense to grid those as well.

Daniel Lichtblau
Wolfram Research

References:
- Speed up calculating the pair correlation function for 2D point data
  - From: Szabolcs Horvát <szhorvat@gmail.com>

Prev by Date: Re: Replace in operators once again

Next by Date: Re: Wolfram workbench on MacOsX

Previous by thread: Re: Speed up calculating the pair correlation function for

Next by thread: Replace in operators once again