Re: Re: Pure Function for String Selection

• To: mathgroup at smc.vnet.net
• Subject: [mg61497] Re: [mg61051] Re: Pure Function for String Selection
• From: "Oyvind Tafjord" <tafjord at wolfram.com>
• Date: Wed, 19 Oct 2005 23:08:35 -0400 (EDT)
• References: <dht479\$hl1\$1@smc.vnet.net> <200510080649.CAA21019@smc.vnet.net>
• Sender: owner-wri-mathgroup at wolfram.com

```----- Original Message -----
From: "Maxim" <ab_def at prontomail.com>
To: mathgroup at smc.vnet.net
Subject: [mg61497] [mg61051] Re: Pure Function for String Selection

> On Tue, 4 Oct 2005 05:33:29 +0000 (UTC), Edson Ferreira
> <edsferr at uol.com.br> wrote:
>
> > Dear members,
> >
> > I want to define a pure function to filter a set of strings.
> >
> > The strings that compose the set have all the same length and the only
> > characters in these strings are "1", "X" and "2".
> >
> > The function that I want is like the one bellow:
> >
> > In[1]:=
> > Unprotect[D];
> > In[2]:=
> > U={"2","X"};
> > In[3]:=
> > M={"1","2"};
> > In[4]:=
> > D={"1","X"};
> > In[5]:=
> > T={"1","2","X"};
> > In[6]:=
> > L=Flatten[Outer[StringJoin,T,T,T,D]];
> > In[7]:=
> > L = Select[L, Count[Characters[#], "1"] > 1 &];
> >
> > In this case, it counts the number of characters "1" in each string and
> > select the ones that have more than one "1".
> >
> > I want a pure function, to be applied like the one in the example above,
> > but for a different task.
> >
> > For each string, I want it to count the maximum number of repeated
> > characters for each character.
> >
> > In other words, It must count the maximum number of repeated "1", "X"
> > and "2" for each string.
> >
> > The string must be "selected" if:
> >
> > The longest run of repeated "1" is shorter than 8 characters
> > AND
> > The longest run of repeated "X" is shorter than 6 characters
> > AND
> > The longest run of repeated "2" is shorter than 6 characters
> >
> > For example:
> > "11112X122X1XXX" should be "selected"
> > (there are four "1" in sequence, 3 "X" in sequence and 2 "2" in
sequence)
> >
> > "122XXXXXX222XX"  should NOT be "selected"
> > (there are six "X" in sequence)
> >
> > "11111111222112" should NOT be "selected"
> > (there are 8 "1" in sequence)
> >
> > Thanks a lot !!!!!
> >
> > Edson Ferreira
> >
> >
>
> This is very straightforward to do with RegularExpression:
>
> In[1]:= Select[{"11112X122X1XXX", "122XXXXXX222XX", "11111111222112"},
>    StringFreeQ[#, RegularExpression["1{8,}|X{6,}|2{6,}"]]&]
>
> Out[1]= {"11112X122X1XXX"}

Note that using {8} instead of {8,} will also do the trick here, as well as
StringFreeQ[#,"11111111"|"XXXXXX"|"222222"]&.

>
> There is one catch though: in Mathematica {m,} quantifier is not
> documented (it means m or more occurences in a row). It's a very basic
> construct, but the Mathematica documentation for RegularExpression
> contains many other omissions where it's not clear whether it's safe to
> use certain features. In particular, the documentation doesn't mention
> named patterns; atomic grouping (?>); conditions; recursive patterns, even
> though they all seem to be available.

At least at the moment, the regular expression functionality is using the
PCRE library (www.pcre.org), so all the functionality in that library is
directly available, and should be for the foreseeable future.

>
> Besides, Mathematica string patterns and regex patterns don't go together
> well:
>
> In[2]:= StringMatchQ["aa", RegularExpression["(.)\\1"]]
>
> Out[2]= True
>
> In[3]:= StringMatchQ["aa", x : RegularExpression["(.)\\1"]]
>
> Out[3]= False
>
> Here x is represented as a numbered subpattern too, so \\1 now refers to
> the whole expression. This is mentioned in the Advanced Documentation, but
> it's not obvious how to resolve this without named subpatterns (?P<name>):
> we cannot use x:RegularExpression["(.)\\2"] as it generates an error
> (RegularExpression::error15).

Yes, this is a known limitation of the interplay between Mathematica pattern
variables and the regular expression patterns.

>
> Another complication is that we can't use \$n to refer to numbered
> subpatterns on the rhs of the rule if the pattern includes Condition or
> PatternTest:
>
> In[4]:= StringCases["a1b2", RegularExpression["(.)\\d"]?
>    (OddQ @@ ToCharacterCode@ #&) -> "\$1"]
>
> Out[4]= {"\$1"}
>
> It looks more like a bug than a deliberate design, and in any case it
> isn't explained in the documentation. So it seems safe to use
> RegularExpression only by itself, not in combination with pattern
> names/conditions/tests.

The "\$n" type substitutions only happens when the pattern is a strict
regular expression (head RegularExpression). Any other pattern is considered
a Mathematica string pattern for which such substitutions do not happen.

>
> On the other hand, if one needs to work with strings of digit characters,
> it may be better to use RegularExpression because of some bugs in the
> automatic conversion of string patterns to regexes:
>
> In[5]:= StringMatchQ["112", x_ ~~ x_ ~~ "2"]
>
> Out[5]= False
>
> We can see what went wrong by examining the internal form of the pattern:
>
> In[6]:= StringPattern`PatternConvert[x_ ~~ x_ ~~ "2"]
>
> Out[6]= {"(?ms)(.)\\12", {{Hold[x], 1}}, {}, Hold[None]}
>
> The sequence \\12 is the backreference number 12, not backreference 1
> followed by "2". The pattern should have been "(.)(?:\\1)2".

Yes, that's clearly a bug which will get fixed for the next release.

Oyvind Tafjord
Wolfram Research

>
> Maxim Rytin
> m.r at inbox.ru

```

• Prev by Date: Re: Desperate help
• Next by Date: Re: Unsorted Union
• Previous by thread: Re: Pure Function for String Selection
• Next by thread: Re: Re: Pure Function for String Selection