Re: Extracting some elements, members of another list
- To: mathgroup at smc.vnet.net
- Subject: [mg112502] Re: Extracting some elements, members of another list
- From: Leonid Shifrin <lshifr at gmail.com>
- Date: Fri, 17 Sep 2010 06:43:51 -0400 (EDT)
Ray, I have no good explanation for this. I can confirm most of your observations, except that in my case the timing stabilizes around 8 seconds, which is about 2x of the first (fresh-kernel) result. I suspect that this has to do with some shared memory (references). In particular, clearing history like so: Unprotect[Out]; DownValues[Out] = {}; Protect[Out]; Seems to help a little. The other thing which comes to mind is that this could be due to some internals of Dispatch. When we use Dispatch, some hash-table is created internally. When we abandon a local variable which stored the Dispatched rules, presumably the hash table has to also be destroyed. Question is, when does this happen, and in case if it is not destroyed timely, would it slow down the operation of another Dispatch in some way. I did not observe such behavior before, but here Dispatch was used on a pretty large list, so may be this effect is only visible for large sets of data. OTOH, my code using Dispatch only: In[28]:= list21/.Dispatch[Thread[#[[All,1]]->#]&[list11]]//Short//Timing Out[28]= {3.312,{{2707844,data121541},{4171257,data402011},<<499996>>,{9613764,data217507},{6452511,data971822}}} Does not seem to suffer much from this problem, so it may be due to the entanglement of Dispatch and Module removing its local variable <tag> from the symbol table. Yet another reason is that the way I generate (model) data, I create 1000000 symbols in the Global` context. This seems to complicate the symbol lookup. If, for the sake of example, I modify the memberPositions function as: Clear[memberPositions]; memberPositions[x_List, y_List] := Pick[Range[Length[x]], Replace[x, Dispatch[Thread[Rule[Intersection[x, y], "tag"]]], {1}], "tag"] and also generate the data as strings rather than symbols: list11 = Transpose[{RandomInteger[{1000000, 9999999}, 1000000], Table["data" <> ToString[i], {i, 1, 1000000}]}]; list21 = RandomSample[list11[[All, 1]], 500000]; Then the problem is much less pronounced (you must re-start the kernel to see it). I suspect that with so many symbols in the Global`, some kind of re-hashing may happen as a result of running memberPositions (since it has a local variable <tag>, which temporarily enters the symbol table), and possibly these re-hashs can accumulate and lead to the overall slow-down. Anyways, I subscribe to your question and hope that someone will enlighten us. Regards, Leonid On Thu, Sep 16, 2010 at 2:01 PM, Ray Koopman <koopman at sfu.ca> wrote: > On Sep 15, 1:38 am, Leonid Shifrin <lsh... at gmail.com> wrote: > > Hi, > > > > The following function: > > > > Clear[memberPositions]; > > memberPositions[x_List, y_List] := > > Module[{tag}, > > Pick[Range[Length[x]], > > Replace[x, Dispatch[Thread[Rule[Intersection[x, y], tag]]], {1}], > > tag]] > > > > Is a fast way to find positions in list <x> which are also in <y>, for > > arbitrary lists <x> and <y>. Using it, it worked in about 5 seconds on my > > machine (M7.0 32bit WinXP AMD Phenom II - 2800 Ghz). Here is how you do > it: > > > > Model your lists: > > > > In[16]:= list1 = Transpose[ > > {RandomInteger[{1000000,9999999},1000000], > > Table[ToExpression["data"<>ToString[i]],{i,1,1000000}]}]; > > > > In[17]:= list2 = RandomSample[list1[[All,1]],500000]; > > > > Get the result: > > > > In[18]:= list1[[memberPositions[list1[[All,1]],list2]]]//Short//Timing > > Out[18]= {5.203,{{9506759,data1},{3854419,data2},{1093656,data3}, > > <<526944>>,{6991722,data999997},{9278309,data1000000}}} > > This is very quick the first time thru, but each successive call > slows by an increasing amount. Here are 5 sequential sample times > for the same two input lists: {4.84, 5.34, 6.39, 7.21, 8.95}. > What's causing the slowdown, and how can it be fixed? > >