Re:Re: Pure Function for String Selection
- To: mathgroup at smc.vnet.net
- Subject: [mg61071] Re:[mg61051] Re: Pure Function for String Selection
- From: "Edson Ferreira" <edsferr at uol.com.br>
- Date: Sun, 9 Oct 2005 01:35:37 -0400 (EDT)
- Sender: owner-wri-mathgroup at wolfram.com
Dear Members,
The new winner is Maxim Rytin with his filter5 function:
In[1]:=
selectString1Q[str_String] :=
Module[{ch=Characters[str]},ch={First[#],Length[#]}&/@Split[ch];
Max[Last /@ Select[ch, MatchQ[#, {"X", _}] &]] < 6 &&
Max[Last /@ Select[ch, MatchQ[#, {"2", _}] &]] < 6 &&
Max[Last /@ Select[ch, MatchQ[#, {"1", _}] &]] < 8
] ;
In[2]:=
selectString2Q[str_String] :=
Module[{ch},
ch =StringCases[str,y:(x_)..:>{x,StringLength[y]}];
Max[Last /@ Select[ch, MatchQ[#, {"X", _}] &]] < 6 &&
Max[Last /@ Select[ch, MatchQ[#, {"2", _}] &]] < 6 &&
Max[Last /@ Select[ch, MatchQ[#, {"1", _}] &]] < 8
] ;
In[3]:=
maxseq[s_String,z_String]:=Max[StringLength/@StringCases[s,z..]];
selectString3Q[str_String]:=
maxseq[str,"1"]<8&&maxseq[str,"X"]<6&&maxseq[str,"2"]<6;
In[5]:=
selectString4Q[str_String]:=
StringFreeQ[str,"1"~~"1"~~"1"~~"1"~~"1"~~"1"~~"1"~~"1"] &&
StringFreeQ[str,"X"~~"X"~~"X"~~"X"~~"X"~~"X"] &&
StringFreeQ[str,"2"~~"2"~~"2"~~"2"~~"2"~~"2"];
In[6]:=
selectString5Q[str_String]:=
StringFreeQ[str,RegularExpression["1{8,}|X{6,}|2{6,}"]];
In[7]:=
filter1[origList:{__String}]:=Select[origList,selectString1Q[#]&];
filter2[origList:{__String}]:=Select[origList,selectString2Q[#]&];
filter3[origList:{__String}]:=Select[origList,selectString3Q[#]&];
filter4[origList:{__String}]:=Select[origList,selectString4Q[#]&];
filter5[origList:{__String}]:=Select[origList,selectString5Q[#]&];
In[12]:=
makeList[strLen_,listLen_]:=
Table[StringJoin[{"1","2","X"}\[LeftDoubleBracket]
Table[Random[
Integer,{1,3}],{strLen}]\[RightDoubleBracket]],{listLen}];
In[13]:=
egList1=makeList[14,1000];
egList2=makeList[30,1000];
egList3=makeList[30,20000];
egList4=makeList[100,20000];
In[17]:=
{Timing[Length@filter3[egList1]],Timing[Length@filter2[egList1]],
Timing[Length@filter1[egList1]],Timing[Length@filter4[egList1]],
Timing[Length@filter5[egList1]]}
Out[17]=
{{0.4 Second,976},{1.232 Second,976},{1.262 Second,976},{0.26 Second,
976},{0.07 Second,976}}
In[18]:=
{Timing[Length@filter1[egList2]],Timing[Length@filter3[egList2]],
Timing[Length@filter2[egList2]],Timing[Length@filter4[egList2]],
Timing[Length@filter5[egList2]]}
Out[18]=
{{2.243 Second,948},{0.411 Second,948},{2.143 Second,948},{0.24 Second,
948},{0.07 Second,948}}
In[19]:=
{Timing[Length@filter2[egList3]],Timing[Length@filter1[egList3]],
Timing[Length@filter3[egList3]],Timing[Length@filter5[egList3]],
Timing[Length@filter4[egList3]]}
Out[19]=
{{43.302 Second,19043},{42.231 Second,19043},{8.712 Second,
19043},{1.392 Second,19043},{4.807 Second,19043}}
In[20]:=
{Timing[Length@filter5[egList4]],Timing[Length@filter2[egList4]],
Timing[Length@filter3[egList4]],
Timing[Length@filter1[egList4]],Timing[Length@filter4[egList4]]}
Out[20]=
{{2.854 Second,16685},{121.445 Second,16685},{13.719 Second,
16685},{119.893 Second,16685},{5.868 Second,16685}}
An even better solution!!!
Thanks and congratulations!
Edson Ferreira
Mechanical Enginner - Brazil
> On Tue, 4 Oct 2005 05:33:29 +0000 (UTC), Edson Ferreira
> wrote:
>
> > Dear members,
> >
> > I want to define a pure function to filter a set of strings.
> >
> > The strings that compose the set have all the same length and the onl=
y
> > characters in these strings are "1", "X" and "2".
> >
> > The function that I want is like the one bellow:
> >
> > In[1]:=
> > Unprotect[D];
> > In[2]:=
> > U={"2","X"};
> > In[3]:=
> > M={"1","2"};
> > In[4]:=
> > D={"1","X"};
> > In[5]:=
> > T={"1","2","X"};
> > In[6]:=
> > L=Flatten[Outer[StringJoin,T,T,T,D]];
> > In[7]:=
> > L = Select[L, Count[Characters[#], "1"] > 1 &];
> >
> > In this case, it counts the number of characters "1" in each string a=
nd
> > select the ones that have more than one "1".
> >
> > I want a pure function, to be applied like the one in the example abo=
ve,
> > but for a different task.
> >
> > For each string, I want it to count the maximum number of repeated
> > characters for each character.
> >
> > In other words, It must count the maximum number of repeated "1", "X"=
> > and "2" for each string.
> >
> > The string must be "selected" if:
> >
> > The longest run of repeated "1" is shorter than 8 characters
> > AND
> > The longest run of repeated "X" is shorter than 6 characters
> > AND
> > The longest run of repeated "2" is shorter than 6 characters
> >
> > For example:
> > "11112X122X1XXX" should be "selected"
> > (there are four "1" in sequence, 3 "X" in sequence and 2 "2" in seque=
nce)
> >
> > "122XXXXXX222XX" should NOT be "selected"
> > (there are six "X" in sequence)
> >
> > "11111111222112" should NOT be "selected"
> > (there are 8 "1" in sequence)
> >
> > Thanks a lot !!!!!
> >
> > Edson Ferreira
> >
> >
>
> This is very straightforward to do with RegularExpression:
>
> In[1]:= Select[{"11112X122X1XXX", "122XXXXXX222XX", "11111111222112"}=
,
> StringFreeQ[#, RegularExpression["1{8,}|X{6,}|2{6,}"]]&]
>
> Out[1]= {"11112X122X1XXX"}
>
> There is one catch though: in Mathematica {m,} quantifier is not
> documented (it means m or more occurences in a row). It's a very basic =
> construct, but the Mathematica documentation for RegularExpression
> contains many other omissions where it's not clear whether it's safe to=
> use certain features. In particular, the documentation doesn't mention =
> named patterns; atomic grouping (?>); conditions; recursive patterns, e=
ven
> though they all seem to be available.
>
> Besides, Mathematica string patterns and regex patterns don't go togeth=
er
> well:
>
> In[2]:= StringMatchQ["aa", RegularExpression["(.)\\1"]]
>
> Out[2]= True
>
> In[3]:= StringMatchQ["aa", x : RegularExpression["(.)\\1"]]
>
> Out[3]= False
>
> Here x is represented as a numbered subpattern too, so \\1 now refers t=
o
> the whole expression. This is mentioned in the Advanced Documentation, =
but
> it's not obvious how to resolve this without named subpatterns (?P):
> we cannot use x:RegularExpression["(.)\\2"] as it generates an error
> (RegularExpression::error15).
>
> Another complication is that we can't use $n to refer to numbered
> subpatterns on the rhs of the rule if the pattern includes Condition or=
> PatternTest:
>
> In[4]:= StringCases["a1b2", RegularExpression["(.)\\d"]?
> (OddQ @@ ToCharacterCode@ #&) -> "$1"]
>
> Out[4]= {"$1"}
>
> It looks more like a bug than a deliberate design, and in any case it
> isn't explained in the documentation. So it seems safe to use
> RegularExpression only by itself, not in combination with pattern
> names/conditions/tests.
>
> On the other hand, if one needs to work with strings of digit character=
s,
> it may be better to use RegularExpression because of some bugs in the
> automatic conversion of string patterns to regexes:
>
> In[5]:= StringMatchQ["112", x_ ~~ x_ ~~ "2"]
>
> Out[5]= False
>
> We can see what went wrong by examining the internal form of the patter=
n:
>
> In[6]:= StringPattern`PatternConvert[x_ ~~ x_ ~~ "2"]
>
> Out[6]= {"(?ms)(.)\\12", {{Hold[x], 1}}, {}, Hold[None]}
>
> The sequence \\12 is the backreference number 12, not backreference 1
> followed by "2". The pattern should have been "(.)(?:\\1)2".
>
> Maxim Rytin
> m.r at inbox.ru
>
>