Re: Pure Function for String Selection
- To: mathgroup at smc.vnet.net
- Subject: [mg61051] Re: Pure Function for String Selection
- From: Maxim <ab_def at prontomail.com>
- Date: Sat, 8 Oct 2005 02:49:51 -0400 (EDT)
- References: <dht479$hl1$1@smc.vnet.net>
- Sender: owner-wri-mathgroup at wolfram.com
On Tue, 4 Oct 2005 05:33:29 +0000 (UTC), Edson Ferreira
<edsferr at uol.com.br> wrote:
> Dear members,
>
> I want to define a pure function to filter a set of strings.
>
> The strings that compose the set have all the same length and the only
> characters in these strings are "1", "X" and "2".
>
> The function that I want is like the one bellow:
>
> In[1]:=
> Unprotect[D];
> In[2]:=
> U={"2","X"};
> In[3]:=
> M={"1","2"};
> In[4]:=
> D={"1","X"};
> In[5]:=
> T={"1","2","X"};
> In[6]:=
> L=Flatten[Outer[StringJoin,T,T,T,D]];
> In[7]:=
> L = Select[L, Count[Characters[#], "1"] > 1 &];
>
> In this case, it counts the number of characters "1" in each string and
> select the ones that have more than one "1".
>
> I want a pure function, to be applied like the one in the example above,
> but for a different task.
>
> For each string, I want it to count the maximum number of repeated
> characters for each character.
>
> In other words, It must count the maximum number of repeated "1", "X"
> and "2" for each string.
>
> The string must be "selected" if:
>
> The longest run of repeated "1" is shorter than 8 characters
> AND
> The longest run of repeated "X" is shorter than 6 characters
> AND
> The longest run of repeated "2" is shorter than 6 characters
>
> For example:
> "11112X122X1XXX" should be "selected"
> (there are four "1" in sequence, 3 "X" in sequence and 2 "2" in sequence)
>
> "122XXXXXX222XX" should NOT be "selected"
> (there are six "X" in sequence)
>
> "11111111222112" should NOT be "selected"
> (there are 8 "1" in sequence)
>
> Thanks a lot !!!!!
>
> Edson Ferreira
>
>
This is very straightforward to do with RegularExpression:
In[1]:= Select[{"11112X122X1XXX", "122XXXXXX222XX", "11111111222112"},
StringFreeQ[#, RegularExpression["1{8,}|X{6,}|2{6,}"]]&]
Out[1]= {"11112X122X1XXX"}
There is one catch though: in Mathematica {m,} quantifier is not
documented (it means m or more occurences in a row). It's a very basic
construct, but the Mathematica documentation for RegularExpression
contains many other omissions where it's not clear whether it's safe to
use certain features. In particular, the documentation doesn't mention
named patterns; atomic grouping (?>); conditions; recursive patterns, even
though they all seem to be available.
Besides, Mathematica string patterns and regex patterns don't go together
well:
In[2]:= StringMatchQ["aa", RegularExpression["(.)\\1"]]
Out[2]= True
In[3]:= StringMatchQ["aa", x : RegularExpression["(.)\\1"]]
Out[3]= False
Here x is represented as a numbered subpattern too, so \\1 now refers to
the whole expression. This is mentioned in the Advanced Documentation, but
it's not obvious how to resolve this without named subpatterns (?P<name>):
we cannot use x:RegularExpression["(.)\\2"] as it generates an error
(RegularExpression::error15).
Another complication is that we can't use $n to refer to numbered
subpatterns on the rhs of the rule if the pattern includes Condition or
PatternTest:
In[4]:= StringCases["a1b2", RegularExpression["(.)\\d"]?
(OddQ @@ ToCharacterCode@ #&) -> "$1"]
Out[4]= {"$1"}
It looks more like a bug than a deliberate design, and in any case it
isn't explained in the documentation. So it seems safe to use
RegularExpression only by itself, not in combination with pattern
names/conditions/tests.
On the other hand, if one needs to work with strings of digit characters,
it may be better to use RegularExpression because of some bugs in the
automatic conversion of string patterns to regexes:
In[5]:= StringMatchQ["112", x_ ~~ x_ ~~ "2"]
Out[5]= False
We can see what went wrong by examining the internal form of the pattern:
In[6]:= StringPattern`PatternConvert[x_ ~~ x_ ~~ "2"]
Out[6]= {"(?ms)(.)\\12", {{Hold[x], 1}}, {}, Hold[None]}
The sequence \\12 is the backreference number 12, not backreference 1
followed by "2". The pattern should have been "(.)(?:\\1)2".
Maxim Rytin
m.r at inbox.ru
- Follow-Ups:
- Re: Re: Pure Function for String Selection
- From: "Oyvind Tafjord" <tafjord@wolfram.com>
- Re: Re: Pure Function for String Selection