MathGroup Archive: February 2010 [00204]

[Date Index] [Thread Index] [Author Index]
Re: position of sequence of numbers in list
To: mathgroup at smc.vnet.net
Subject: [mg107217] Re: position of sequence of numbers in list
From: Norbert Pozar <bertapozar at gmail.com>
Date: Fri, 5 Feb 2010 07:12:57 -0500 (EST)
References: <201001301212.HAA25132@smc.vnet.net> <hk3nmu$ahf$1@smc.vnet.net>
Your conjecture can be easily verified:

In[1]:= SetSystemOptions[ "PackedArrayOptions"->{"UnpackMessage"->True}];
In[2]:= SparseArray[Range[0,2]]
Out[2]= SparseArray[<2>,{3}]
In[3]:= ArrayRules@%
During evaluation of In[3]:= Developer`FromPackedArray::unpack1:
Unpacking array with dimensions {2}. >>
Out[3]= {{2}->1,{3}->2,{_}->0}
In[4]:= ArrayRules@Range[0,2]
During evaluation of In[4]:= Developer`FromPackedArray::unpack1:
Unpacking array with dimensions {2}. >>
Out[4]= {{2}->1,{3}->2,{_}->0}

As you can see, the unpacked array has dimension {2}, i.e. it is the
list of positions stored in the SparseArray object. ArrayRules most
likely first uses SparseArray and then extracts its data, converting
it to the list of rules.

Another easy application that comes to mind is DeleteCases, using the
other piece of useful data inside SparseArray:

sparseArrayElements[HoldPattern[SparseArray[u___]]] := {u}[[4, 3]]
fastDeleteCases[vector_List, n_] :=
 sparseArrayElements[SparseArray[vector, Automatic, n]]

In[1]:= test=RandomInteger[10,1000000];
First@Timing@(r1=DeleteCases[test,5])
During evaluation of In[1]:= Developer`FromPackedArray::punpack1:
Unpacking array with dimensions {1000000} to level 1. >>
Out[2]= 0.234
In[3]:= First@Timing@(r2=fastDeleteCases[test,5])
Out[3]= 0.062
In[4]:= r1==r2
During evaluation of In[4]:= Developer`FromPackedArray::unpack:
Unpacking array in call to Equal. >>
Out[4]= True
In[5]:= Developer`PackedArrayQ[r1]
Developer`PackedArrayQ[r2]
Out[5]= False
Out[6]= True

The difference is not as impressive, but still interesting.

Anyway, I don't know how well the structure of SparseArray is known, I
made an attempt to describe it here
http://www.math.ucla.edu/~npozar/doc/sparseArrays.pdf

Best,
Norbert


On Thu, Feb 4, 2010 at 1:06 AM, Leonid Shifrin <lshifr at gmail.com> wrote:
> Norbert,
>
> what can I say - this is truly impressive. This is order of magnitude fas=
ter
> than the built-in position, for this problem. This  implies that the ma=
in
> slow-down for the previous SparseArray-based solution was due to ArrayRul=
es
> being slow for dense arrays. In retrospect, this seems natural since
> somewhere along the way something similar to unpacking must happen, since
> the output is a list of rules. Probably internally SparseArray uses a pac=
ked
> form for positions, and the way you extract them (directly with Part)
> probably avoids the unpacking stage - this is the only plausible explanat=
ion
> which comes to my mind.
>
> I checked with Developer`PackedArrayQ on the resulting position list for
> your solution, and it gives True, which supports my conjecture. Your me=
thod
> seems to be applicable to many other problems where SparseArray-based tri=
cks
> are employed. This is something truly useful, thanks a lot for sharing!
>
> Best,
> Leonid
>
> P.S.
>
> By the way, this is irrelevant for performance, but the position-extracti=
on
> function can be rewritten without HoldPattern:
>
> extractPositionFromSparseArrayAlt[(h : SparseArray)[u___]] := {u}[[4, 2=
, 2]]
>
> Of course, in any case the key is your observation that Part can not be
> directly used (since it has been internally overloaded for SparseArray
> objects).
>
>
>
>
> On Thu, Feb 4, 2010 at 4:59 AM, Norbert Pozar <bertapozar at gmail.com> wrot=
e:
>>
>> I'm sorry, I just realized that one doesn't need 1-Unitize..., this will
>> do:
>>
>> positionExtr[x_List, n_] :=
>>  Flatten@extractPositionFromSparseArray[
>>   SparseArray[Unitize[x - n], Automatic, 1]]
>>
>> Norbert
>>
>> On Wed, Feb 3, 2010 at 5:52 PM, Norbert Pozar <bertapozar at gmail.com>
>> wrote:
>> > Hi Leonid,
>> >
>> > I just found another way that outperforms both of our solutions on a
>> > wide range of integer arrays I tested. It turns out that one can
>> > extract the position directly from SparseArray, without using
>> > ArrayRules. It took me some time to figure it out:
>> >
>> > extractPositionFromSparseArray[
>> >  HoldPattern[SparseArray[u___]]] := {u}[[4, 2, 2]]
>> >
>> > positionExtr[x_List, n_] :=
>> >  Flatten@extractPositionFromSparseArray[
>> >   SparseArray[1 - Unitize[x - n]]]
>> >
>> > Again, no unpacking.
>> >
>> > Best,
>> > Norbert
>> >
>> > On Wed, Feb 3, 2010 at 12:21 PM, Leonid Shifrin <lshifr at gmail.com>
>> > wrote:
>> >> Hi Norbert,
>> >>
>> >> Thanks again,  nice bits of information! I admit to not having done
>> >> proper
>> >> tests and therefore
>> >> missed the slow-down of SparseArray that you mention. I think  that
>> >> your
>> >> proposed hybrid solution
>> >> is indeed the best option from what we know so far. As for Clip vs
>> >> Unitize -
>> >> I have no doubt that Unitize must be better for what it does, being a
>> >> more
>> >> specialized command in a sense.  My old solution (which I attempted=
 to
>> >> reconstruct in my previous post) used additional UnitStep or somethin=
g
>> >> to
>> >> make Clip work for reals, with some extra overhead of course.
>> >>
>> >> Regards,
>> >> Leonid
>> >>
>> >>
>> >> On Wed, Feb 3, 2010 at 11:06 PM, Norbert Pozar <bertapozar at gmail.com>
>> >> wrote:
>> >>>
>> >>> Hi Leonid,
>> >>>
>> >>> that's a nice observation. I was exploring ArrayRules too, but I fou=
nd
>> >>> out that it is too slow when the array is quite dense. I was testing
>> >>> it always on RandomInteger[1,...]. That has density 1/2. On the othe=
r
>> >>> hand, DeleteDuplicates is quite independent of the density. When the
>> >>> density drops below ~1/50, ArrayRules start performing better than
>> >>> DeleteDuplicates. So I propose the following method, since Total is
>> >>> really fast:
>> >>>
>> >>> positionComb[x_List, n_Integer] :=
>> >>>  If[50 Total[#] < Length[x], Flatten@ArrayRules[#][[;; -2, 1]],
>> >>>    Rest@DeleteDuplicates@Prepend[# Range[Length[x]], 0]] &[
>> >>>  1 - Unitize[x - n]]
>> >>>
>> >>> Some timings (by the way, the timing varies a lot, so accuracy no
>> >>> better than +-50%):
>> >>>
>> >>> In[1]:= tst=RandomInteger[1,40000];
>> >>> Timing[Do[positionNP[tst,1],{50}]][[1]]/50
>> >>> Timing[Do[myPositionNew[tst,1],{50}]][[1]]/50
>> >>> Timing[Do[positionComb[tst,1],{50}]][[1]]/50
>> >>> Out[2]= 0.005
>> >>> Out[3]= 0.04064
>> >>> Out[4]= 0.00468
>> >>>
>> >>> In[5]:= tst=RandomInteger[500,40000];
>> >>> Timing[Do[positionNP[tst,1],{50}]][[1]]/50
>> >>> Timing[Do[myPositionNew[tst,1],{50}]][[1]]/50
>> >>> Timing[Do[positionComb[tst,1],{50}]][[1]]/50
>> >>> Out[6]= 0.00218
>> >>> Out[7]= 0.00218
>> >>> Out[8]= 0.00188
>> >>>
>> >>>
>> >>> Two points:
>> >>> 1) you don't need ArrayRules@SparseArray@, ArrayRules@ is enough, ev=
en
>> >>> though there is no performance benefit =)
>> >>> 2) Unitize is better than Clip since it also works with Reals,
>> >>> Unitize[x]==0 iff  x==0
>> >>>
>> >>> > Anyways, I often find it  amazing how far one can go in speeding=
 up
>> >>> > things
>> >>> > in
>> >>> > Mathematica - sometimes it can be really fast.
>> >>> Well, this is true, but I think it'd be much better if Position was
>> >>> implemented to work fast with packed arrays of integers, or Pick or
>> >>> something ;)
>> >>>
>> >>>
>> >>> Best,
>> >>> Norbert
>> >>>
>> >>> On Wed, Feb 3, 2010 at 10:39 AM, Leonid Shifrin <lshifr at gmail.com>
>> >>> wrote:
>> >>> > Hi Norbert,
>> >>> >
>> >>> > Thanks a lot - this is indeed pretty fast. And the way you use thi=
s
>> >>> > in
>> >>> > Fold
>> >>> > is quite amazing,
>> >>> > as well as the observation that there is no unpacking - very cool.
>> >>> > As
>> >>> > far
>> >>> > as speeding up of
>> >>> > the myPosition function is concerned,  I toyed with precisely th=
e
>> >>> > same
>> >>> > idea
>> >>> > before
>> >>> > in version 5. I used Clip instead of Unitze (essentially
>> >>> > implementing
>> >>> > Unitize), but have completely forgotten about it until I saw your
>> >>> > solution.
>> >>> > I now used Unitize to improve it a little (about 5-10 %). My
>> >>> > benchmarks
>> >>> > show
>> >>> > that both my old and new versions are about twice faster than your=
s:
>> >>> >
>> >>> > In[1]:= Clear[myPositionOld, positionNP, myPositionNew];
>> >>> > myPositionOld[x_List, n_Integer] :=
>> >>> >   #[[All, 1]] &@
>> >>> >    Most@ArrayRules@SparseArray[1 - Clip[Abs[x - n], {0, 1}]];
>> >>> >
>> >>> > positionNP[x_List, n_Integer] :=
>> >>> >   Rest@DeleteDuplicates@Prepend[#, 0] &[Times[Range[Length[x]], =
(1 -
>> >>> > Unitize[x - n])]];
>> >>> >
>> >>> > myPositionNew[x_List, n_Integer] := #[[All, 1]] &@
>> >>> >    Most@ArrayRules@SparseArray[1 - Unitize[x - n]];
>> >>> >
>> >>> >
>> >>> > In[4]:= Timing[Do[myPositionOld[tst, 10], {50}]][[1]]/50
>> >>> >
>> >>> > Out[4]= 0.10562
>> >>> >
>> >>> > In[5]:= Timing[Do[positionNP[tst, 10], {50}]][[1]]/50
>> >>> >
>> >>> > Out[5]= 0.1928
>> >>> >
>> >>> > In[6]:= Timing[Do[myPositionNew[tst, 10], {50}]][[1]]/50
>> >>> >
>> >>> > Out[6]= 0.09564
>> >>> >
>> >>> > In[7]:=
>> >>> > Flatten@myPositionOld[tst, 10] == positionNP[tst, 10] ==
>> >>> >  Flatten[myPositionNew[tst, 10]]
>> >>> >
>> >>> > Out[7]= True
>> >>> >
>> >>> > Anyways, I often find it  amazing how far one can go in speeding=
 up
>> >>> > things
>> >>> > in
>> >>> > Mathematica - sometimes it can be really fast. Thanks for the new
>> >>> > info -
>> >>> > I
>> >>> > had no idea
>> >>> > that DeleteDuplicates is so fast on packed arrays, and I neither w=
as
>> >>> > I
>> >>> > aware
>> >>> > of Unitize.
>> >>> >
>> >>> > Regards,
>> >>> > Leonid
>> >>> >
>> >>> >
>> >>> >
>> >>> > On Wed, Feb 3, 2010 at 12:43 PM, Norbert P. <bertapozar at gmail.com>
>> >>> > wrote:
>> >>> >>
>> >>> >> Hi Leonid,
>> >>> >>
>> >>> >> I guess JB doesn't care about speed improvement anymore, but this
>> >>> >> is
>> >>> >> an idea that I've been using for a week (since getting Mathematic=
a
>> >>> >> 7)
>> >>> >> that makes finding position in a packed array much faster. This
>> >>> >> works
>> >>> >> only in the case when one wants to find positions of all
>> >>> >> subsequences
>> >>> >> (see my code in In[6] and notice that my old computer is much
>> >>> >> slower
>> >>> >> than yours):
>> >>> >>
>> >>> >> In[1]:= list=RandomInteger[{1,15},3000000];
>> >>> >> seq={3,4,5,6};
>> >>> >> In[3]:= r1=Flatten@Position[Partition[list,4,1],{3,4,5,6}];//=
Timing
>> >>> >> Out[3]= {4.485,Null}
>> >>> >> In[4]:= r2=ReplaceList[list,{u___,3,4,5,6,___}:>Length[{u}]+1=
];//
>> >>> >> Timing
>> >>> >> Out[4]= {5.453,Null}
>> >>> >> In[5]:=
>> >>> >> r3=myPosition[myPartition[list,Length[seq]],seq,-1];//Timing
>> >>> >> Out[5]= {2.719,Null}
>> >>> >>
>> >>> >> In[6]:= fdz[v_]:=Rest@DeleteDuplicates@Prepend[v,0]
>> >>> >> r4=Fold[fdz[#1
>> >>> >> (1-Unitize[list[[#1]]-#2])]+1&,fdz[Range[Length[list]-
>> >>> >> Length[seq]+1](1-Unitize[list[[;;-Length[seq]]]-seq[[1]]])]
>> >>> >> +1,Rest@seq]-Length[seq];//Timing
>> >>> >> Out[7]= {0.422,Null}
>> >>> >>
>> >>> >> In[8]:= r1==r2==r3==r4
>> >>> >> Out[8]= True
>> >>> >>
>> >>> >> myPosition and myPartition are the functions from your post.
>> >>> >> I'm essentially using DeleteDuplicates together with Unitize to
>> >>> >> find
>> >>> >> positions of all occurrences of a specific number in an array. No
>> >>> >> unpacking occurs so it's quite fast. You can use this to possibly
>> >>> >> improve myPosition.
>> >>> >>
>> >>> >> Best,
>> >>> >> Norbert
>> >>> >>
>> >>> >> On Jan 31, 2:57 am, Leonid Shifrin <lsh... at gmail.com> wrote:
>> >>> >> > Hi again,
>> >>> >> >
>> >>> >> > In my first post one of the solutions (the compiled function)
>> >>> >> > contained
>> >>> >> > a
>> >>> >> > bug:
>> >>> >> >
>> >>> >> > In[1]:= posf[{1, 2, 3, 4, 4, 5, 6, 7}, {4, 5, 6}]
>> >>> >> >
>> >>> >> > Out[1]= -1
>> >>> >> >
>> >>> >> > which was because it ignores all but the first candidate
>> >>> >> > sequences in
>> >>> >> > the
>> >>> >> > case when they overlap. Here is a modified one which is
>> >>> >> > (hopefully)
>> >>> >> > correct,
>> >>> >> > if not as fast:
>> >>> >> >
>> >>> >> > posf2=
>> >>> >> > Compile[{{lst,_Integer,1},{target,_Integer,1}},
>> >>> >> >     Module[{i=1,len =Length[target],lstlen = Length[l=
st]},
>> >>> >> >       While[i<=lstlen-len+1&&Take[lst,{i,len+i-1}]!=t=
arget,i++];
>> >>> >> >        If[i>lstlen-len+1,-1,i]]]
>> >>> >> >
>> >>> >> > In[3]:= posf2[{1, 2, 3, 4, 4, 5, 6, 7}, {4, 5, 6}]
>> >>> >> >
>> >>> >> > Out[3]= 5
>> >>> >> >
>> >>> >> > This one will still be much superior to Partition-based
>> >>> >> > implementation
>> >>> >> > for
>> >>> >> > cases when  you can expect the sequence of interest  to app=
ear
>> >>> >> > rather
>> >>> >> > early
>> >>> >> > in the list. Anyway, sorry for the confusion with the buggy
>> >>> >> > version.
>> >>> >> >
>> >>> >> > By the way, should you wish to stick to Partition-based
>> >>> >> > implementation,
>> >>> >> > I
>> >>> >> > think it is fair to mention that for small sequence sizes and
>> >>> >> > partitioning
>> >>> >> > with a shift 1, *and*  when you have a list already in the pa=
cked
>> >>> >> > array
>> >>> >> > representation (which is possible when  your numbers are say =
all
>> >>> >> > integers or
>> >>> >> > all reals, but not a mix), one can implement a more efficient
>> >>> >> > version
>> >>> >> > than
>> >>> >> > the built-in Partition:
>> >>> >> >
>> >>> >> > Clear[myPartition];
>> >>> >> > myPartition[x_List, size_Integer] :=
>> >>> >> >   With[{len = Length[x]},
>> >>> >> >    Transpose@Table[x[[i ;; len - size + i]], {i, 1, size}]]=
;
>> >>> >> >
>> >>> >> > In[4]:= largeTestList = RandomInteger[{1, 15}, 3000000];
>> >>> >> >
>> >>> >> > In[5]:= (pt = Partition[largeTestList, 2, 1]); // Timing
>> >>> >> >
>> >>> >> > Out[5]= {0.521, Null}
>> >>> >> >
>> >>> >> > In[6]:= (mpt = myPartition[largeTestList , 2]); // Timing
>> >>> >> >
>> >>> >> > Out[6]= {0.17, Null}
>> >>> >> >
>> >>> >> > In[7]:= pt == mpt
>> >>> >> >
>> >>> >> > Out[7]= True
>> >>> >> >
>> >>> >> > The built-in Partition will start winning when the partitioning
>> >>> >> > size
>> >>> >> > will be
>> >>> >> > around 30-50, so for long sequences using a built-in is better.
>> >>> >> > If
>> >>> >> > your
>> >>> >> > list
>> >>> >> > is not packed, built-in Partition  will be a few times faster
>> >>> >> > even
>> >>> >> > for
>> >>> >> > small
>> >>> >> > partitioning sizes, so this will then be a de-optimization. You
>> >>> >> > can
>> >>> >> > attempt
>> >>> >> > to convert your list to packed array with Developer`ToPackedArr=
ay
>> >>> >> > (note
>> >>> >> > that
>> >>> >> > it does not issue any messages in case it is unable to do this)=
,
>> >>> >> > and
>> >>> >> > check
>> >>> >> > that your list is packed with Developer`PackedArrayQ.
>> >>> >> >
>> >>> >> > Likewise, you can implement your own Position-like function aim=
ed
>> >>> >> > at
>> >>> >> > exactly
>> >>> >> > this problem, which, under similar requirements of your list
>> >>> >> > being a
>> >>> >> > packed
>> >>> >> > array of numbers, will be better than the built-in Position in
>> >>> >> > most
>> >>> >> > cases:
>> >>> >> >
>> >>> >> > In[8]:=
>> >>> >> > Position[Partition[largeTest , 4, 1], {3, 4, 5, 6}, 1,
>> >>> >> >   1] // Timing
>> >>> >> >
>> >>> >> > Out[8]= {0.27, {{41940}}}
>> >>> >> >
>> >>> >> > In[9]:= myPosition[myPartition[largeTest , 4], {3, 4, 5, 6},
>> >>> >> >   1] // Timing
>> >>> >> >
>> >>> >> > Out[9]= {0.09, {41940}}
>> >>> >> >
>> >>> >> > In[10]:= Position[Partition[largeTest , 4, 1], {3, 4, 5, 6}, =
1,
>> >>> >> >   2] // Timing
>> >>> >> >
>> >>> >> > Out[10]= {0.301, {{41940}, {228293}}}
>> >>> >> >
>> >>> >> > In[11]:= myPosition[myPartition[largeTest , 4], {3, 4, 5, 6},
>> >>> >> >   2] // Timing
>> >>> >> >
>> >>> >> > Out[11]= {0.311, {41940, 228293}}
>> >>> >> >
>> >>> >> > In[12]:= Position[Partition[largeTest , 4, 1], {3, 4, 5, 6}] =
//
>> >>> >> > Timing
>> >>> >> >
>> >>> >> > Out[12]= {0.55, {{41940}, {228293}, {269300}}}
>> >>> >> >
>> >>> >> > In[13]:= myPosition[
>> >>> >> >   myPartition[largeTest , 4], {3, 4, 5, 6}, -1] // Timing
>> >>> >> >
>> >>> >> > Out[13]= {0.411, {41940, 228293, 269300}}
>> >>> >> >
>> >>> >> > where the last argument -1 is a convention to return all result=
s.
>> >>> >> >
>> >>> >> > Regards,
>> >>> >> > Leonid
>> >>> >> >
>> >>> >> > On Sat, Jan 30, 2010 at 4:12 AM, JB <jke... at gmail.com> wrote:
>> >>> >> > > Hi,
>> >>> >> >
>> >>> >> > > What is the most efficient way to find the position of the
>> >>> >> > > beginning
>> >>> >> > > of a sequence of numbers from a list?
>> >>> >> >
>> >>> >> > > I found a couple of ways:
>> >>> >> >
>> >>> >> > > find 3,4 in list={1,2,3,4,5};
>> >>> >> >
>> >>> >> > >  1.   pos=Intersection[Position[list,3],(Position[list,=
4])+1]
>> >>> >> >
>> >>> >> > >  2.   pos=Position[Partition[list,2,1],{3,4}]
>> >>> >> >
>> >>> >> > > Are there other ways to do this?
>> >>> >> > > What is the best way when dealing with large lists?
>> >>> >> >
>> >>> >> > > Thanks,
>> >>> >> > > JB
>> >>> >> >
>> >>> >> >
>> >>> >>
>> >>> >
>> >>> >
>> >>
>> >>
>> >
>
>
Prev by Date: a harmless notebook crashes Windows 7: who else had it?
Next by Date: Re: Re: Complex conjugate differentiation
Previous by thread: Re: position of sequence of numbers in list
Next by thread: Re: solving equations with BitXor