MathGroup Archive 2005

[Date Index] [Thread Index] [Author Index]

Search the Archive

Re: Pure Function for String Selection

  • To: mathgroup at smc.vnet.net
  • Subject: [mg61051] Re: Pure Function for String Selection
  • From: Maxim <ab_def at prontomail.com>
  • Date: Sat, 8 Oct 2005 02:49:51 -0400 (EDT)
  • References: <dht479$hl1$1@smc.vnet.net>
  • Sender: owner-wri-mathgroup at wolfram.com

On Tue, 4 Oct 2005 05:33:29 +0000 (UTC), Edson Ferreira  
<edsferr at uol.com.br> wrote:

> Dear members,
>
> I want to define a pure function to filter a set of strings.
>
> The strings that compose the set have all the same length and the only  
> characters in these strings are "1", "X" and "2".
>
> The function that I want is like the one bellow:
>
> In[1]:=
> Unprotect[D];
> In[2]:=
> U={"2","X"};
> In[3]:=
> M={"1","2"};
> In[4]:=
> D={"1","X"};
> In[5]:=
> T={"1","2","X"};
> In[6]:=
> L=Flatten[Outer[StringJoin,T,T,T,D]];
> In[7]:=
> L = Select[L, Count[Characters[#], "1"] > 1 &];
>
> In this case, it counts the number of characters "1" in each string and  
> select the ones that have more than one "1".
>
> I want a pure function, to be applied like the one in the example above,  
> but for a different task.
>
> For each string, I want it to count the maximum number of repeated  
> characters for each character.
>
> In other words, It must count the maximum number of repeated "1", "X"  
> and "2" for each string.
>
> The string must be "selected" if:
>
> The longest run of repeated "1" is shorter than 8 characters
> AND
> The longest run of repeated "X" is shorter than 6 characters
> AND
> The longest run of repeated "2" is shorter than 6 characters
>
> For example:
> "11112X122X1XXX" should be "selected"
> (there are four "1" in sequence, 3 "X" in sequence and 2 "2" in sequence)
>
> "122XXXXXX222XX"  should NOT be "selected"
> (there are six "X" in sequence)
>
> "11111111222112" should NOT be "selected"
> (there are 8 "1" in sequence)
>
> Thanks a lot !!!!!
>
> Edson Ferreira
>
>

This is very straightforward to do with RegularExpression:

In[1]:= Select[{"11112X122X1XXX", "122XXXXXX222XX", "11111111222112"},
   StringFreeQ[#, RegularExpression["1{8,}|X{6,}|2{6,}"]]&]

Out[1]= {"11112X122X1XXX"}

There is one catch though: in Mathematica {m,} quantifier is not  
documented (it means m or more occurences in a row). It's a very basic  
construct, but the Mathematica documentation for RegularExpression  
contains many other omissions where it's not clear whether it's safe to  
use certain features. In particular, the documentation doesn't mention  
named patterns; atomic grouping (?>); conditions; recursive patterns, even  
though they all seem to be available.

Besides, Mathematica string patterns and regex patterns don't go together  
well:

In[2]:= StringMatchQ["aa", RegularExpression["(.)\\1"]]

Out[2]= True

In[3]:= StringMatchQ["aa", x : RegularExpression["(.)\\1"]]

Out[3]= False

Here x is represented as a numbered subpattern too, so \\1 now refers to  
the whole expression. This is mentioned in the Advanced Documentation, but  
it's not obvious how to resolve this without named subpatterns (?P<name>):  
we cannot use x:RegularExpression["(.)\\2"] as it generates an error  
(RegularExpression::error15).

Another complication is that we can't use $n to refer to numbered  
subpatterns on the rhs of the rule if the pattern includes Condition or  
PatternTest:

In[4]:= StringCases["a1b2", RegularExpression["(.)\\d"]?
   (OddQ @@ ToCharacterCode@ #&) -> "$1"]

Out[4]= {"$1"}

It looks more like a bug than a deliberate design, and in any case it  
isn't explained in the documentation. So it seems safe to use  
RegularExpression only by itself, not in combination with pattern  
names/conditions/tests.

On the other hand, if one needs to work with strings of digit characters,  
it may be better to use RegularExpression because of some bugs in the  
automatic conversion of string patterns to regexes:

In[5]:= StringMatchQ["112", x_ ~~ x_ ~~ "2"]

Out[5]= False

We can see what went wrong by examining the internal form of the pattern:

In[6]:= StringPattern`PatternConvert[x_ ~~ x_ ~~ "2"]

Out[6]= {"(?ms)(.)\\12", {{Hold[x], 1}}, {}, Hold[None]}

The sequence \\12 is the backreference number 12, not backreference 1  
followed by "2". The pattern should have been "(.)(?:\\1)2".

Maxim Rytin
m.r at inbox.ru


  • Prev by Date: Re: MathML, Mozilla, fonts and Mathematica 5.2
  • Next by Date: Re: Globally limiting precision or accuracy
  • Previous by thread: Re: Pure Function for String Selection
  • Next by thread: Re: Re: Pure Function for String Selection