Re: Pure Function for String Selection

*To*: mathgroup at smc.vnet.net*Subject*: [mg61085] Re: Pure Function for String Selection*From*: "dkr" <dkrjeg at adelphia.net>*Date*: Sun, 9 Oct 2005 01:36:38 -0400 (EDT)*References*: <dht479$hl1$1@smc.vnet.net><200510070738.DAA03287@smc.vnet.net> <di7ra7$kov$1@smc.vnet.net>*Sender*: owner-wri-mathgroup at wolfram.com

Edson, Congratulations. Your approach is very elegant. I like it very much. The main attribute of your approach is that it will stop examining a string as soon as it has found the first "bad" run, while filter1, filter2, and filter3 all share the undesirable trait of identifying ALL bad runs in a string. I think this attribute is a major source of the relative efficiency of your approach. Consider the following alternative approaches: selectString5Q[str_String]:= StringFreeQ[str,{"1"~~"1"~~"1"~~"1"~~"1"~~"1"~~"1"~~"1", "X"~~"X"~~"X"~~"X"~~"X"~~"X", "2"~~"2"~~"2"~~"2"~~"2"~~"2"}]; This approach is a minor variant of your approach, using the fact that StringFreeQ can accomodate a list of patterns for its second argument. selectString6Q[str_String]:= StringCases[ str,{"X"~~"X"~~"X"~~"X"~~"X"~~"X","2"~~"2"~~"2"~~"2"~~"2"~~"2", "1"~~"1"~~"1"~~"1"~~"1"~~"1"~~"1"~~"1"},1]==={}; This approach attempts to remedy the undesirable trait of the earlier approaches by stopping after it finds the first bad run. The two new approaches above lead to : filter5[origList:{__String}]:=Select[origList,selectString5Q]; filter6[origList:{__String}]:=Select[origList,selectString6Q]; (Note we can get away with, eg., selectString5Q rather than selectString5Q[#]& as the second arg of Select, though the latter does no harm.) Finally we use an approach analogous to the one we used for your earlier string reduction problem. We take your list of strings and form one large string with appropriate delimiters, and then turn the string pattern matching techniques loose on this large string. filter7[origList:{__String}]:= StringSplit[ StringReplace[ToString[{#}]&/@origList//StringJoin, StringExpression[ "{",(WordCharacter)...,{"X"~~"X"~~"X"~~"X"~~"X"~~"X", "2"~~"2"~~"2"~~"2"~~"2"~~"2", "1"~~"1"~~"1"~~"1"~~"1"~~"1"~~"1"~~"1"},(WordCharacter)...,"}"]:> ""],Characters["{}"]..]; ToString[{#}]&/@origList//StringJoin turns your original list of strings into a single string, enclosing each individual string in brackets. This is identical to what we did in the previous string reduction thread. Then we run StringReplace on this long string, essentially weeding out any of the original strings which have a bad run. Finally, we use StringSplit to eliminate the delimiters and give us back a list of the remaining strings. Comparing these three approaches along with filter4 from your message we find: In[8]:= makeList[strLen_,listLen_]:= Table[StringJoin[{"1","2","X"}\[LeftDoubleBracket] Table[Random[ Integer,{1,3}],{strLen}]\[RightDoubleBracket]],{listLen}]; In[9]:= SeedRandom[1234]; egList1=makeList[14,1000]; egList2=makeList[30,1000]; egList3=makeList[30,20000]; egList4=makeList[100,20000]; In[14]:= {Timing[Length@filter4[egList1]],Timing[Length@filter5[egList1]], Timing[Length@filter6[egList1]],Timing[Length@filter7[egList1]]} Out[14]= {{0.13 Second,981},{0.09 Second,981},{0.09 Second,981},{0.04 Second,981}} In[15]:= {Timing[Length@filter7[egList2]],Timing[Length@filter4[egList2]], Timing[Length@filter5[egList2]],Timing[Length@filter6[egList2]]} Out[15]= {{0.06 Second,952},{0.11 Second,952},{0.09 Second,952},{0.1 Second,952}} In[16]:= {Timing[Length@filter6[egList3]],Timing[Length@filter7[egList3]], Timing[Length@filter4[egList3]],Timing[Length@filter5[egList3]]} Out[16]= {{1.27 Second,18989},{0.63 Second,18989},{1.06 Second,18989},{0.94 Second, 18989}} In[17]:= {Timing[Length@filter5[egList4]],Timing[Length@filter6[egList4]], Timing[Length@filter7[egList4]],Timing[Length@filter4[egList4]]} Out[17]= {{1.61 Second,16625},{1.26 Second,16625},{1.48 Second,16625},{1.31 Second, 16625}} (Note that I have rotated the order of appearance of the functions in the lists. The Timing function in Mathematica is sensitive to the state of the Mathematica session when it is invoked. Functions later in the list might benefit from internal tables and the like which Mathematica may have created in computing previous elements of the list. Presumably being last in the list is the most advantageous. By rotating the placement of the functions in the list, I hope to minimize such bias. Of course the best way would be to restart Mathematica for each timing, with SeedRandom being used to insure the values of egList1 etc. remain the same from session to session.) The results seems to favor filter7. It does slip a bit on egList4, but I think most of the time is being taken up (see below) by inserting delimiters at the beginning and then deleting them at the end. I haven't really paid much attention to these two aspects, and presumably they could be speeded up by taking advantage of special structure in the problem, such as that all the strings in the original list have the same length. I give that some additional thought. (Here is the time to change the list into a string with appropriate delimiters) In[18]:= Timing[ToString[{#}]&/@egList4//StringJoin;] Out[18]= {0.58 Second,Null} In[21]:= xxx=StringReplace[ToString[{#}]&/@egList4//StringJoin, StringExpression[ "{",(WordCharacter)...,{"X"~~"X"~~"X"~~"X"~~"X"~~"X", "2"~~"2"~~"2"~~"2"~~"2"~~"2", "1"~~"1"~~"1"~~"1"~~"1"~~"1"~~"1"~~"1"},(WordCharacter)...,"}"]:> ""]; (Here is the time to reverse the process and get back to a list of remaining strings) In[22]:= Timing[StringSplit[xxx,Characters["{}"]..];] Out[22]= {0.7 Second,Null} dkr

**References**:**Re: Pure Function for String Selection***From:*"dkr" <dkrjeg@adelphia.net>