[Date Index]
[Thread Index]
[Author Index]
Re: Extracting some elements, members of another list
*To*: mathgroup at smc.vnet.net
*Subject*: [mg112421] Re: Extracting some elements, members of another list
*From*: Leonid Shifrin <lshifr at gmail.com>
*Date*: Wed, 15 Sep 2010 04:37:41 -0400 (EDT)
Hi,
The following function:
Clear[memberPositions];
memberPositions[x_List, y_List] :=
Module[{tag},
Pick[Range[Length[x]],
Replace[x, Dispatch[Thread[Rule[Intersection[x, y], tag]]], {1}],
tag]]
Is a fast way to find positions in list <x> which are also in <y>, for
arbitrary lists <x> and <y>. Using it, it worked in about 5 seconds on my
machine (M7.0 32bit WinXP AMD Phenom II - 2800 Ghz). Here is how you do it:
Model your lists:
In[16]:=
list1 =
Transpose[{RandomInteger[{1000000,9999999},1000000],Table[ToExpression["data"<>ToString[i]],{i,1,1000000}]}];
In[17]:= list2 = RandomSample[list1[[All,1]],500000];
Get the result:
In[18]:= list1[[memberPositions[list1[[All,1]],list2]]]//Short//Timing
Out[18]=
{5.203,{{9506759,data1},{3854419,data2},{1093656,data3},<<526944>>,{6991722,data999997},{9278309,data1000000}}}
You get more elements here because obviously some phone numbers were
randomly generated more than
once. Be aware though that such large lists consume lots of memory - about
320 Mb here.
The phone numbers could be strings or whatever, the memberPositions function
does not assume anything
about them. With this method, you have an added advantage that the resulting
list is in the original order -
no need to resort. It can also handle repeated phone numbers.
Now, if all of the following conditions are satisfied:
1. Your phone numbers can be represented as machine-precision integers
2. Your list list2 contains only phone numbers from list1
3. The list list1 does not contain repeated phone numbers (list2 may)
Then it is possible to use the fact that Sort and Ordering are very fast on
packed arrays. Additional idea here
is to combine vectorized operations on packed arrays with Compile (this
combination often results in great
performance in Mathematica), to get a version of a binary search which finds
positions of many elements in a sorted list at once.The initial list of
phone numbers is ordered first, then positions of elements in list2 are
found by a massive binary search function, and finally the positions in the
original unsorted list are recovered. The resulting more restrictive version
of memberPositions will be about 3 times faster than the above general one
for the
lists of the size you requested:
Clear[memberPositionsInteger];
Module[{bsearchMassiveAux, bsearchMassive},
bsearchMassiveAux =
Compile[{{list, _Integer, 1}, {elems, _Integer,
1}, {ones, _Integer, 1}},
Module[{len = Length[ones], n1 = ones, n0 = ones, ctr = 0,
m = ones,
diff = ones, un1 = ones, un2 = ones, flags = ones,
zeros = 0*ones},
n1 = Length[list]*n0;
While[flags != zeros,
m = m + (Floor[(n0 + n1)/2] - m)*Unitize[flags];
flags = list[[m]] - elems;
un1 = UnitStep[flags - 1];
un2 = UnitStep[-flags - 1];
n1 = n1*un2 + (m - 1)*un1;
n0 = n0*un1 + (m + 1)*un2;];
m]];
bsearchMassive[main_, elems_] :=
bsearchMassiveAux[main, elems, ConstantArray[1, {Length[elems]}]];
memberPositionsInteger[x_, y_] :=
With[{xpacked = Developer`ToPackedArray[x],
ypacked = Developer`ToPackedArray[y]},
With[{ord = Ordering[xpacked]},
Sort@ord[[bsearchMassive[xpacked[[ord]], ypacked]]]]]
];
Here is the timing comparison on a fresh kernel:
In[5]:= list1 =
Transpose[{RandomSample[Range[1000000,9999999],1000000],Table[ToExpression["data"<>ToString[i]],{i,1,1000000}]}];
In[6]:= list2 = RandomSample[list1[[All,1]],500000];
In[7]:= list1[[memberPositionsInteger[list1[[All,1]],list2]]]//Short//Timing
Out[7]=
{2.015,{{3301095,data2},{6853443,data5},{3638770,data8},<<499995>>,{8382696,data999998},{2135145,data999999}}}
In[8]:= list1[[memberPositions[list1[[All,1]],list2]]]//Short//Timing
Out[8]=
{5.485,{{3301095,data2},{6853443,data5},{3638770,data8},<<499995>>,{8382696,data999998},{2135145,data999999}}}
So, here we are down from your 5 hours to about 2 seconds, which looks
pretty good to me. With your Select-MemberQ method, one can not expect fast
results since its complexity is ~Length[list1]*Length[list2]. The
complexity of the memberPositionsInteger is roughly
C1*Log[Length[list1]]*Length[list2] +
C2*Log[Length[list1]]*Length[list1]+C3*Log[Length[list2]]*Length[list2], but
C2 and C3 are pretty low since they are controlled by the built-in code, and
C1 is not too high either since both Compile and packed arrays have been
fully utilized in bsearchMassiveAux. In the case of memberPositions, the
bottleneck becomes the Dispatch-based rule application, since it is
hash-table based and, although liner in Length[x1], has a relatively large
time constant. OTOH, it is much more general, and the code is simpler.
Hope this helps.
Regards,
Leonid
On Tue, Sep 14, 2010 at 1:14 PM, Nacho <ncc1701zzz at gmail.com> wrote:
> Hi!
>
> I would like your advice about this problem:
>
> I have a list of elements, consisting in phone numbers with some data:
>
> list1= { { phone1, data1}, {phone2, data2}, {phone3, data3} .... }
>
> And a list of phones, a subset of the phone numbers in list1
>
> list2= {phone2, phone3, phone7... }
>
> I'd like to extract the elements from list1 whose phone numbers are
> present in list 2, that is:
>
> result= { { phone2, data2}, {phone3, data3, {phone7, data7} .... }
>
> I've used this with small lists and it works fine:
>
> result = Select[list1, MemberQ[list2, #[[1]]] &];
>
> The problem is that now I would like to use it with big lists. list1
> is over 1.000.000 elements long and list2 is about 500.000 elements
> long. Ordering is not a problem, I could resort the lists.
>
> Any hint to extract this list faster? It seems to take a lot of time
> (estimation is about 5 hours and I have to do it repeatedly)
>
>
> Thanks!
>
>
Prev by Date:
** Re: Extracting some elements, members of another list**
Next by Date:
**Re: PolarPlot3D?**
Previous by thread:
** Re: Extracting some elements, members of another list**
Next by thread:
**Re: Extracting some elements, members of another list**
| |