Re: More Efficient Method
- To: mathgroup at smc.vnet.net
- Subject: [mg105159] Re: More Efficient Method
- From: Ray Koopman <koopman at sfu.ca>
- Date: Mon, 23 Nov 2009 06:49:54 -0500 (EST)
- References: <200911201138.GAA03384@smc.vnet.net> <heb6hm$aja$1@smc.vnet.net>
If you want something less complex than Leonid's code, here are two simple speedups for the IntervalMemberQ approach. First, it seems to run a little faster with integer limits than with real limits. datalist = Table[Random[Integer,11*^5],{m = 1*^6}]; targetlist = Table[{n,n+20},{n,100,m,50}]; Block[{realreject = Interval@@({.5,-.5}+#&)/@targetlist}, Select[datalist,!IntervalMemberQ[realreject,#]&]] //Length //Timing {9.75 Second, 654464} Block[{integereject = Interval@@({1,-1}+#&)/@targetlist}, Select[datalist,!IntervalMemberQ[integereject,#]&]] //Length //Timing {8.69 Second, 654464} Second, if you expect to keep more items than you drop then it's probably faster to look for the donuts than the holes. Block[{keep = Interval@@Partition[Join[{Min[datalist,targetlist]-1}, Flatten@targetlist,{Max[datalist,targetlist]+1}], 2]}, Select[datalist,IntervalMemberQ[keep,#]&]] //Length //Timing {7.47 Second, 654464} On Nov 22, 3:16 am, Leonid Shifrin <lsh... at gmail.com> wrote: > Hi Brian, > > This is a follow-up to my previous post. Since you mentioned performance > on large datasets as a key objective, I further optimized the code of my > previous post. It now gives 3-4 times the speed of the solutions based > on Select/IntervalMemberQ, > at least in the region of the problem's parameter space that I was able > to test. Here is the code: > > ---------------------------------------------------------------------- > > Clear[deleteRegionsNew]; > deleteRegionsNew[x_?VectorQ, regs_?ListQ, splitNumber_: Automatic] := > With[{snum = > If[splitNumber === Automatic, > 50000*(1 + IntegerPart[Length[regs]/10000]), splitNumber]}, > Join @@ > Map[deleteRegionsAux[#, regs] &, partitionWithTail[x, snum]]]; > > Clear[partitionWithTail]; > partitionWithTail[l_List, size_Integer] := > If[IntegerQ[Length[l]/size], > #, > Append[#, Drop[l, Length[Flatten[#, 1]]]]] &@ > Partition[l, size]; > > Clear[deleteRegionsAux]; > deleteRegionsAux[x_, regs_] := > Module[{sorted, ord, sortedRegs, ones , xl, minn = Min[x] - 2, > maxx = Max[x] + 2}, > xl = Join[x, {minn, maxx}]; > sortedRegs = > Sort@DeleteCases[ > Clip[regs, {minn + 1, maxx - 1}], {minn + 1, > minn + 1} | {maxx - 1, maxx - 1}]; > ones = ConstantArray[1, {Length[sortedRegs]}]; > sorted = xl[[ord = Ordering[xl]]]; > Drop[ xl[[Sort[ord[[#]]]]], -2] &@ > Complement[Range[Length[sorted]], > Apply[Join, > Range @@@ Transpose[{bsearchMaxMassive[sorted, #1, ones], > bsearchMinMassive[sorted, #2, ones]} & @@ > Transpose[sortedRegs]]]]]; > > Clear[bsearchMinMassive]; > bsearchMinMassive = > Compile[{{list, _Real, 1}, {elems, _Real, 1}, {ones, _Integer, 1}}, > Module[{len = Length[ones], n1 = ones, n0 = ones, ctr = 0, > m = ones, diff = ones, un1 = ones, un2 = ones}, > n1 = Length[list]*n0; > While[Sign[n0 - n1] != ones, > m = Floor[(n0 + n1)/2]; > un1 = Floor@UnitStep[list[[m]] - elems]; > un2 = ones - un1; > n1 = n1*un2 + (m - 1)*un1; > n0 = n0*un1 + (m + 1)*un2; > ]; > Floor[m - UnitStep[list[[m*UnitStep[m]]] - elems]]]]; > > Clear[bsearchMaxMassive]; > bsearchMaxMassive = > Compile[{{list, _Real, 1}, {elems, _Real, 1}, {ones, _Integer, 1}}, > Module[{len = Length[ones], n1 = ones, n0 = ones, ctr = 0, > m = ones, diff = ones, un1 = ones, un2 = ones, m1 = ones}, > n1 = Length[list]*n0; > While[Sign[n0 - n1] != ones, > m = Floor[(n0 + n1)/2]; > un1 = Floor@UnitStep[elems - list[[m]]]; > un2 = ones - un1; > n1 = n1*un1 + (m - 1)*un2; > n0 = n0*un2 + (m + 1)*un1; > ]; > Floor[m + UnitStep[elems - list[[m]]]]]]; > > > (* Daniel's solution for comparison*) > > Clear[subsets2]; > subsets2[data_?VectorQ, tarList_?ListQ] := > Module[{intv = Apply[Interval, Map[# + {.5, -.5} &, tarList]]}, > Select[data, ! IntervalMemberQ[intv, #] &]] > > ---------------------------------------------------------------------- > > Benchmarking on a large dataset: > > In[1]:= > datalist=RandomInteger[1100000,6500000]; > targetlist=Table[{n,n+20},{n,100,6500000,50}]; > > In[2]:= Length[targetlist] > > Out[2]= 129999 > > In[3]:= > > Timing[resultdata=subsets2[datalist,targetlist];] > > Out[3]= {88.797,Null} > > In[4]:= Timing[resultdata2=deleteRegionsNew[datalist,targetlist];] > > Out[4]= {26.629,Null} > > In[5]:= resultdata===resultdata2 > > Out[5]= True > > In[6]:= > datalist = RandomInteger[1100000, 6500000]; > targetlist = Table[{n, n + 20}, {n, 100, 6500000, 5000}]; > > In[7]:= Length[targetlist] > > Out[7]= 1300 > > In[8]:= > Timing[resultdata = subsets2[datalist, targetlist];] > > Out[8]= {46.927, Null} > > In[9]:= Timing[resultdata2 = deleteRegionsNew[datalist, targetlist];] > > Out[9]= {10.625, Null} > > In[10]:= resultdata === resultdata2 > > Out[10]= True > > My code uses one tuning parameter - the third optional parameter which > determines the size of the data chunk to split the original data list > into. I use some heuristics to set it up when it is not explicitly set. > > The main idea of the implementation is to sort the initial dataset, > use binary search to determine the starting and ending positions of > elements in sorted set which get into the holes, and then extract only > those elements which do not. I can expand this description upon request. > > Regards, > Leonid > > On Fri, Nov 20, 2009 at 3:38 AM, blamm64 <blam... at charter.net> wrote: > >> I have a couple of functions designed to poke a single hole, and to >> poke multiple holes, in a one-level list: >> >> We define a function which, given the imported pressure data, finds >> the subset of that pressure data excluding the pressure data points >> between "targetL " and "targetU". >> >> In[5]:= findsubset[data_?VectorQ,targetL_?NumericQ,targetU_? >> NumericQ] := Select[data,(#<=targetL || #>=targetU &)] >> >> This function will pluck out multiple holes in the data list. >> >> In[6]:= subsets[data_?VectorQ,tarList_?ListQ]:=Module[{tmp,tmp1}, >> tmp=data; >> Do[tmp1=findsubset[tmp,tarList[[i,1]],tarList[[i,2]]];tmp=tmp1, >> {i,Dimensions[tarList][[1]]}]; >> tmp >> ] >> >> The following works fine (big holes chosen not to give large result): >> >> In[7]:= datalist=Range[11,3411,10]; >> >> In[12]:= targetlist={{40, 1500},{1600,3300}}; >> >> In[13]:= resultdata=subsets[datalist,targetlist] >> >> Out[13]= >> >> {11,21,31,1501,1511,1521,1531,1541,1551,1561,1571,1581,1591, >> 3301,3311,3321,3331,3341,3351,3361,3371,3381,3391,3401,3411} >> >> But if "datalist" happens to be very large, surely there is a (much) >> more efficient method? >> >> I tried unsuccessfully to use pure functions with Select, but have a >> somewhat nebulous feeling there's a pure function way of doing this >> effectively much more efficiently. >> >> I know, I know: the above have no consistency checking. I also know >> "subsets" could be used in place of "findsubset" just by replacing the >> call of "findsubset" with the code of "findsubset" in "subsets". >> >> From what I've seen on this forum there are some really experienced >> people who might provide an efficient way of implementing the above. >> >> -Brian L.
- References:
- More Efficient Method
- From: blamm64 <blamm64@charter.net>
- More Efficient Method