MathGroup Archive: November 2009 [00708]

[Date Index] [Thread Index] [Author Index]
Re: More Efficient Method
To: mathgroup at smc.vnet.net
Subject: [mg105138] Re: [mg105076] More Efficient Method
From: Leonid Shifrin <lshifr at gmail.com>
Date: Sun, 22 Nov 2009 06:09:04 -0500 (EST)
References: <200911201138.GAA03384@smc.vnet.net>
Hi Brian,

This is a follow-up to my previous post. Since you mentioned performance on
large datasets as a key objective, I further optimized the code of my
previous post. It now gives 3-4 times the speed of the solutions based on
Select/IntervalMemberQ,
at least in the region of the problem's parameter space that I was able to
test.  Here is the code:

-------------------------------------------------------------------------------

Clear[deleteRegionsNew];
deleteRegionsNew[x_?VectorQ, regs_?ListQ, splitNumber_: Automatic] :=
  With[{snum  =
     If[splitNumber === Automatic,
      50000*(1 + IntegerPart[Length[regs]/10000]), splitNumber]},
   Join @@
    Map[deleteRegionsAux[#, regs] &, partitionWithTail[x, snum]]];

Clear[partitionWithTail];
partitionWithTail[l_List, size_Integer] :=
  If[IntegerQ[Length[l]/size],
     #,
     Append[#, Drop[l, Length[Flatten[#, 1]]]]] &@
   Partition[l, size];

Clear[deleteRegionsAux];
deleteRegionsAux[x_, regs_] :=
  Module[{sorted, ord, sortedRegs, ones , xl, minn = Min[x] - 2,
    maxx = Max[x] + 2},
   xl = Join[x, {minn, maxx}];
   sortedRegs =
    Sort@DeleteCases[
      Clip[regs, {minn + 1, maxx - 1}], {minn + 1,
        minn + 1} | {maxx - 1, maxx - 1}];
   ones = ConstantArray[1, {Length[sortedRegs]}];
   sorted = xl[[ord = Ordering[xl]]];
   Drop[ xl[[Sort[ord[[#]]]]], -2] &@
    Complement[Range[Length[sorted]],
     Apply[Join,
      Range @@@ Transpose[{bsearchMaxMassive[sorted, #1, ones],
           bsearchMinMassive[sorted, #2, ones]} & @@
         Transpose[sortedRegs]]]]];

Clear[bsearchMinMassive];
bsearchMinMassive =
  Compile[{{list, _Real, 1}, {elems, _Real, 1}, {ones, _Integer, 1}},
   Module[{len = Length[ones], n1 = ones, n0 = ones, ctr = 0,
     m = ones, diff = ones, un1 = ones, un2 = ones},
    n1 = Length[list]*n0;
    While[Sign[n0 - n1] != ones,
     m = Floor[(n0 + n1)/2];
     un1 = Floor@UnitStep[list[[m]] - elems];
     un2 = ones - un1;
     n1 = n1*un2 + (m - 1)*un1;
     n0 = n0*un1 + (m + 1)*un2;
     ];
    Floor[m - UnitStep[list[[m*UnitStep[m]]] - elems]]]];

Clear[bsearchMaxMassive];
bsearchMaxMassive =
  Compile[{{list, _Real, 1}, {elems, _Real, 1}, {ones, _Integer, 1}},
   Module[{len = Length[ones], n1 = ones, n0 = ones, ctr = 0,
     m = ones, diff = ones, un1 = ones, un2 = ones, m1 = ones},
    n1 = Length[list]*n0;
    While[Sign[n0 - n1] != ones,
     m = Floor[(n0 + n1)/2];
     un1 = Floor@UnitStep[elems - list[[m]]];
     un2 = ones - un1;
     n1 = n1*un1 + (m - 1)*un2;
     n0 = n0*un2 + (m + 1)*un1;
     ];
    Floor[m + UnitStep[elems - list[[m]]]]]];


(* Daniel's solution for comparison*)

Clear[subsets2];
subsets2[data_?VectorQ, tarList_?ListQ] :=
 Module[{intv = Apply[Interval, Map[# + {.5, -.5} &, tarList]]},
  Select[data, ! IntervalMemberQ[intv, #] &]]

----------------------------------------------------------------------------------

Benchmarking on a large dataset:

In[1]:=
datalist=RandomInteger[1100000,6500000];
targetlist=Table[{n,n+20},{n,100,6500000,50}];

In[2]:= Length[targetlist]

Out[2]= 129999

In[3]:=

Timing[resultdata=subsets2[datalist,targetlist];]

Out[3]= {88.797,Null}

In[4]:= Timing[resultdata2=deleteRegionsNew[datalist,targetlist];]

Out[4]= {26.629,Null}

In[5]:= resultdata===resultdata2

Out[5]= True

In[6]:=
datalist = RandomInteger[1100000, 6500000];
targetlist = Table[{n, n + 20}, {n, 100, 6500000, 5000}];

In[7]:= Length[targetlist]

Out[7]= 1300

In[8]:=
Timing[resultdata = subsets2[datalist, targetlist];]

Out[8]= {46.927, Null}

In[9]:= Timing[resultdata2 = deleteRegionsNew[datalist, targetlist];]

Out[9]= {10.625, Null}

In[10]:= resultdata === resultdata2

Out[10]= True

My code uses one tuning parameter - the third optional parameter which
determines the size of the data chunk to split the original data list into.
I use some heuristics to set it up when it is not explicitly set.

The main idea of the implementation is to sort the initial dataset, use
binary search to determine the starting and ending positions of elements in
sorted set which get into the holes, and then extract only those elements
which do not. I can expand this description upon request.


Regards,
Leonid



On Fri, Nov 20, 2009 at 3:38 AM, blamm64 <blamm64 at charter.net> wrote:

> I have a couple of functions designed to poke a single hole, and to
> poke multiple holes, in a one-level list:
>
> We define a function which, given the imported pressure data, finds
> the subset of that pressure data excluding the pressure data points
> between "targetL " and "targetU".
>
> In[5]:= findsubset[data_?VectorQ,targetL_?NumericQ,targetU_?
> NumericQ] := Select[data,(#<=targetL || #>=targetU &)]
>
> This function will pluck out multiple holes in the data list.
>
> In[6]:= subsets[data_?VectorQ,tarList_?ListQ]:=Module[{tmp,tmp1},
> tmp=data;
> Do[tmp1=findsubset[tmp,tarList[[i,1]],tarList[[i,2]]];tmp=tmp1,
> {i,Dimensions[tarList][[1]]}];
> tmp
> ]
>
> The following works fine (big holes chosen not to give large result):
>
> In[7]:= datalist=Range[11,3411,10];
>
> In[12]:= targetlist={{40, 1500},{1600,3300}};
>
> In[13]:= resultdata=subsets[datalist,targetlist]
>
> Out[13]=
>
> {11,21,31,1501,1511,1521,1531,1541,1551,1561,1571,1581,1591,3301,3311,3321,3331,3341,3351,3361,3371,3381,3391,3401,3411}
>
> But if "datalist" happens to be very large, surely there is a (much)
> more efficient method?
>
> I tried unsuccessfully to use pure functions with Select, but have a
> somewhat nebulous feeling there's a pure function way of doing this
> effectively much more efficiently.
>
> I know, I know: the above have no consistency checking.  I also know
> "subsets" could be used in place of "findsubset" just by replacing the
> call of "findsubset" with the code of "findsubset" in "subsets".
>
> >From what I've seen on this forum there are some really experienced
> people who might provide an efficient way of implementing the above.
>
> -Brian L.
>
>
References:
- More Efficient Method
  - From: blamm64 <blamm64@charter.net>
Prev by Date: Re: I broke the sum into pieces
Next by Date: Not all points plot on my graph...
Previous by thread: Re: More Efficient Method
Next by thread: Re: More Efficient Method