MathGroup Archive: November 2009 [00735]

[Date Index] [Thread Index] [Author Index]
Re: More Efficient Method
To: mathgroup at smc.vnet.net
Subject: [mg105159] Re: More Efficient Method
From: Ray Koopman <koopman at sfu.ca>
Date: Mon, 23 Nov 2009 06:49:54 -0500 (EST)
References: <200911201138.GAA03384@smc.vnet.net> <heb6hm$aja$1@smc.vnet.net>
If you want something less complex than Leonid's code, here are two
simple speedups for the IntervalMemberQ approach.  First, it seems
to run a little faster with integer limits than with real limits.

datalist = Table[Random[Integer,11*^5],{m = 1*^6}];
targetlist = Table[{n,n+20},{n,100,m,50}];

Block[{realreject = Interval@@({.5,-.5}+#&)/@targetlist},
Select[datalist,!IntervalMemberQ[realreject,#]&]] //Length //Timing

{9.75 Second, 654464}

Block[{integereject = Interval@@({1,-1}+#&)/@targetlist},
Select[datalist,!IntervalMemberQ[integereject,#]&]] //Length //Timing

{8.69 Second, 654464}

Second, if you expect to keep more items than you drop then
it's probably faster to look for the donuts than the holes.

Block[{keep = Interval@@Partition[Join[{Min[datalist,targetlist]-1},
              Flatten@targetlist,{Max[datalist,targetlist]+1}], 2]},
Select[datalist,IntervalMemberQ[keep,#]&]] //Length //Timing

{7.47 Second, 654464}

On Nov 22, 3:16 am, Leonid Shifrin <lsh... at gmail.com> wrote:
> Hi Brian,
>
> This is a follow-up to my previous post. Since you mentioned performance
> on large datasets as a key objective, I further optimized the code of my
> previous post. It now gives 3-4 times the speed of the solutions based
> on Select/IntervalMemberQ,
> at least in the region of the problem's parameter space that I was able
> to test.  Here is the code:
>
> ----------------------------------------------------------------------
>
> Clear[deleteRegionsNew];
> deleteRegionsNew[x_?VectorQ, regs_?ListQ, splitNumber_: Automatic] :=
>   With[{snum  =
>      If[splitNumber === Automatic,
>       50000*(1 + IntegerPart[Length[regs]/10000]), splitNumber]},
>    Join @@
>     Map[deleteRegionsAux[#, regs] &, partitionWithTail[x, snum]]];
>
> Clear[partitionWithTail];
> partitionWithTail[l_List, size_Integer] :=
>   If[IntegerQ[Length[l]/size],
>      #,
>      Append[#, Drop[l, Length[Flatten[#, 1]]]]] &@
>    Partition[l, size];
>
> Clear[deleteRegionsAux];
> deleteRegionsAux[x_, regs_] :=
>   Module[{sorted, ord, sortedRegs, ones , xl, minn = Min[x] - 2,
>     maxx = Max[x] + 2},
>    xl = Join[x, {minn, maxx}];
>    sortedRegs =
>     Sort@DeleteCases[
>       Clip[regs, {minn + 1, maxx - 1}], {minn + 1,
>         minn + 1} | {maxx - 1, maxx - 1}];
>    ones = ConstantArray[1, {Length[sortedRegs]}];
>    sorted = xl[[ord = Ordering[xl]]];
>    Drop[ xl[[Sort[ord[[#]]]]], -2] &@
>     Complement[Range[Length[sorted]],
>      Apply[Join,
>       Range @@@ Transpose[{bsearchMaxMassive[sorted, #1, ones],
>            bsearchMinMassive[sorted, #2, ones]} & @@
>          Transpose[sortedRegs]]]]];
>
> Clear[bsearchMinMassive];
> bsearchMinMassive =
>   Compile[{{list, _Real, 1}, {elems, _Real, 1}, {ones, _Integer, 1}},
>    Module[{len = Length[ones], n1 = ones, n0 = ones, ctr = 0,
>      m = ones, diff = ones, un1 = ones, un2 = ones},
>     n1 = Length[list]*n0;
>     While[Sign[n0 - n1] != ones,
>      m = Floor[(n0 + n1)/2];
>      un1 = Floor@UnitStep[list[[m]] - elems];
>      un2 = ones - un1;
>      n1 = n1*un2 + (m - 1)*un1;
>      n0 = n0*un1 + (m + 1)*un2;
>      ];
>     Floor[m - UnitStep[list[[m*UnitStep[m]]] - elems]]]];
>
> Clear[bsearchMaxMassive];
> bsearchMaxMassive =
>   Compile[{{list, _Real, 1}, {elems, _Real, 1}, {ones, _Integer, 1}},
>    Module[{len = Length[ones], n1 = ones, n0 = ones, ctr = 0,
>      m = ones, diff = ones, un1 = ones, un2 = ones, m1 = ones},
>     n1 = Length[list]*n0;
>     While[Sign[n0 - n1] != ones,
>      m = Floor[(n0 + n1)/2];
>      un1 = Floor@UnitStep[elems - list[[m]]];
>      un2 = ones - un1;
>      n1 = n1*un1 + (m - 1)*un2;
>      n0 = n0*un2 + (m + 1)*un1;
>      ];
>     Floor[m + UnitStep[elems - list[[m]]]]]];
>
>
> (* Daniel's solution for comparison*)
>
> Clear[subsets2];
> subsets2[data_?VectorQ, tarList_?ListQ] :=
>  Module[{intv = Apply[Interval, Map[# + {.5, -.5} &, tarList]]},
>   Select[data, ! IntervalMemberQ[intv, #] &]]
>
> ----------------------------------------------------------------------
>
> Benchmarking on a large dataset:
>
> In[1]:=
> datalist=RandomInteger[1100000,6500000];
> targetlist=Table[{n,n+20},{n,100,6500000,50}];
>
> In[2]:= Length[targetlist]
>
> Out[2]= 129999
>
> In[3]:=
>
> Timing[resultdata=subsets2[datalist,targetlist];]
>
> Out[3]= {88.797,Null}
>
> In[4]:= Timing[resultdata2=deleteRegionsNew[datalist,targetlist];]
>
> Out[4]= {26.629,Null}
>
> In[5]:= resultdata===resultdata2
>
> Out[5]= True
>
> In[6]:=
> datalist = RandomInteger[1100000, 6500000];
> targetlist = Table[{n, n + 20}, {n, 100, 6500000, 5000}];
>
> In[7]:= Length[targetlist]
>
> Out[7]= 1300
>
> In[8]:=
> Timing[resultdata = subsets2[datalist, targetlist];]
>
> Out[8]= {46.927, Null}
>
> In[9]:= Timing[resultdata2 = deleteRegionsNew[datalist, targetlist];]
>
> Out[9]= {10.625, Null}
>
> In[10]:= resultdata === resultdata2
>
> Out[10]= True
>
> My code uses one tuning parameter - the third optional parameter which
> determines the size of the data chunk to split the original data list
> into. I use some heuristics to set it up when it is not explicitly set.
>
> The main idea of the implementation is to sort the initial dataset,
> use binary search to determine the starting and ending positions of
> elements in sorted set which get into the holes, and then extract only
> those elements which do not. I can expand this description upon request.
>
> Regards,
> Leonid
>
> On Fri, Nov 20, 2009 at 3:38 AM, blamm64 <blam... at charter.net> wrote:
>
>> I have a couple of functions designed to poke a single hole, and to
>> poke multiple holes, in a one-level list:
>>
>> We define a function which, given the imported pressure data, finds
>> the subset of that pressure data excluding the pressure data points
>> between "targetL " and "targetU".
>>
>> In[5]:= findsubset[data_?VectorQ,targetL_?NumericQ,targetU_?
>> NumericQ] := Select[data,(#<=targetL || #>=targetU &)]
>>
>> This function will pluck out multiple holes in the data list.
>>
>> In[6]:= subsets[data_?VectorQ,tarList_?ListQ]:=Module[{tmp,tmp1},
>> tmp=data;
>> Do[tmp1=findsubset[tmp,tarList[[i,1]],tarList[[i,2]]];tmp=tmp1,
>> {i,Dimensions[tarList][[1]]}];
>> tmp
>> ]
>>
>> The following works fine (big holes chosen not to give large result):
>>
>> In[7]:= datalist=Range[11,3411,10];
>>
>> In[12]:= targetlist={{40, 1500},{1600,3300}};
>>
>> In[13]:= resultdata=subsets[datalist,targetlist]
>>
>> Out[13]=
>>
>> {11,21,31,1501,1511,1521,1531,1541,1551,1561,1571,1581,1591,
>>  3301,3311,3321,3331,3341,3351,3361,3371,3381,3391,3401,3411}
>>
>> But if "datalist" happens to be very large, surely there is a (much)
>> more efficient method?
>>
>> I tried unsuccessfully to use pure functions with Select, but have a
>> somewhat nebulous feeling there's a pure function way of doing this
>> effectively much more efficiently.
>>
>> I know, I know: the above have no consistency checking.  I also know
>> "subsets" could be used in place of "findsubset" just by replacing the
>> call of "findsubset" with the code of "findsubset" in "subsets".
>>
>> From what I've seen on this forum there are some really experienced
>> people who might provide an efficient way of implementing the above.
>>
>> -Brian L.
References:
- More Efficient Method
  - From: blamm64 <blamm64@charter.net>
Prev by Date: Re: Re: I broke the sum into pieces
Next by Date: Re: Setting global InputAutoReplacements
Previous by thread: Re: More Efficient Method
Next by thread: Re: More Efficient Method