MathGroup Archive 2005

[Date Index] [Thread Index] [Author Index]

Search the Archive

Re: Re: Pure Function for String Selection

  • To: mathgroup at smc.vnet.net
  • Subject: [mg61566] Re: Re: Pure Function for String Selection
  • From: Maxim <ab_def at prontomail.com>
  • Date: Sat, 22 Oct 2005 03:24:06 -0400 (EDT)
  • Organization: MTU-Intel ISP
  • References: <dht479$hl1$1@smc.vnet.net> <200510080649.CAA21019@smc.vnet.net> <dj731f$d29$1@smc.vnet.net>
  • Sender: owner-wri-mathgroup at wolfram.com

On Thu, 20 Oct 2005 03:30:55 +0000 (UTC), Oyvind Tafjord  
<tafjord at wolfram.com> wrote:

>
> ----- Original Message -----
> From: "Maxim" <ab_def at prontomail.com>
To: mathgroup at smc.vnet.net
> Subject: [mg61566]  Re: Pure Function for String Selection
>
>
>> On Tue, 4 Oct 2005 05:33:29 +0000 (UTC), Edson Ferreira
>> <edsferr at uol.com.br> wrote:
>>
>> > Dear members,
>> >
>> > I want to define a pure function to filter a set of strings.
>> >
>> > The strings that compose the set have all the same length and the only
>> > characters in these strings are "1", "X" and "2".
>> >
>> > The function that I want is like the one bellow:
>> >
>> > In[1]:=
>> > Unprotect[D];
>> > In[2]:=
>> > U={"2","X"};
>> > In[3]:=
>> > M={"1","2"};
>> > In[4]:=
>> > D={"1","X"};
>> > In[5]:=
>> > T={"1","2","X"};
>> > In[6]:=
>> > L=Flatten[Outer[StringJoin,T,T,T,D]];
>> > In[7]:=
>> > L = Select[L, Count[Characters[#], "1"] > 1 &];
>> >
>> > In this case, it counts the number of characters "1" in each string  
>> and
>> > select the ones that have more than one "1".
>> >
>> > I want a pure function, to be applied like the one in the example  
>> above,
>> > but for a different task.
>> >
>> > For each string, I want it to count the maximum number of repeated
>> > characters for each character.
>> >
>> > In other words, It must count the maximum number of repeated "1", "X"
>> > and "2" for each string.
>> >
>> > The string must be "selected" if:
>> >
>> > The longest run of repeated "1" is shorter than 8 characters
>> > AND
>> > The longest run of repeated "X" is shorter than 6 characters
>> > AND
>> > The longest run of repeated "2" is shorter than 6 characters
>> >
>> > For example:
>> > "11112X122X1XXX" should be "selected"
>> > (there are four "1" in sequence, 3 "X" in sequence and 2 "2" in
> sequence)
>> >
>> > "122XXXXXX222XX"  should NOT be "selected"
>> > (there are six "X" in sequence)
>> >
>> > "11111111222112" should NOT be "selected"
>> > (there are 8 "1" in sequence)
>> >
>> > Thanks a lot !!!!!
>> >
>> > Edson Ferreira
>> >
>> >
>>
>> This is very straightforward to do with RegularExpression:
>>
>> In[1]:= Select[{"11112X122X1XXX", "122XXXXXX222XX", "11111111222112"},
>>    StringFreeQ[#, RegularExpression["1{8,}|X{6,}|2{6,}"]]&]
>>
>> Out[1]= {"11112X122X1XXX"}
>
> Note that using {8} instead of {8,} will also do the trick here, as well  
> as
> StringFreeQ[#,"11111111"|"XXXXXX"|"222222"]&.
>
>>
>> There is one catch though: in Mathematica {m,} quantifier is not
>> documented (it means m or more occurences in a row). It's a very basic
>> construct, but the Mathematica documentation for RegularExpression
>> contains many other omissions where it's not clear whether it's safe to
>> use certain features. In particular, the documentation doesn't mention
>> named patterns; atomic grouping (?>); conditions; recursive patterns,  
>> even
>> though they all seem to be available.
>
> At least at the moment, the regular expression functionality is using the
> PCRE library (www.pcre.org), so all the functionality in that library is
> directly available, and should be for the foreseeable future.
>
>>
>> Besides, Mathematica string patterns and regex patterns don't go  
>> together
>> well:
>>
>> In[2]:= StringMatchQ["aa", RegularExpression["(.)\\1"]]
>>
>> Out[2]= True
>>
>> In[3]:= StringMatchQ["aa", x : RegularExpression["(.)\\1"]]
>>
>> Out[3]= False
>>
>> Here x is represented as a numbered subpattern too, so \\1 now refers to
>> the whole expression. This is mentioned in the Advanced Documentation,  
>> but
>> it's not obvious how to resolve this without named subpatterns  
>> (?P<name>):
>> we cannot use x:RegularExpression["(.)\\2"] as it generates an error
>> (RegularExpression::error15).
>
> Yes, this is a known limitation of the interplay between Mathematica  
> pattern
> variables and the regular expression patterns.
>
>>
>> Another complication is that we can't use $n to refer to numbered
>> subpatterns on the rhs of the rule if the pattern includes Condition or
>> PatternTest:
>>
>> In[4]:= StringCases["a1b2", RegularExpression["(.)\\d"]?
>>    (OddQ @@ ToCharacterCode@ #&) -> "$1"]
>>
>> Out[4]= {"$1"}
>>
>> It looks more like a bug than a deliberate design, and in any case it
>> isn't explained in the documentation. So it seems safe to use
>> RegularExpression only by itself, not in combination with pattern
>> names/conditions/tests.
>
> The "$n" type substitutions only happens when the pattern is a strict
> regular expression (head RegularExpression). Any other pattern is  
> considered
> a Mathematica string pattern for which such substitutions do not happen.
>
>>
>> On the other hand, if one needs to work with strings of digit  
>> characters,
>> it may be better to use RegularExpression because of some bugs in the
>> automatic conversion of string patterns to regexes:
>>
>> In[5]:= StringMatchQ["112", x_ ~~ x_ ~~ "2"]
>>
>> Out[5]= False
>>
>> We can see what went wrong by examining the internal form of the  
>> pattern:
>>
>> In[6]:= StringPattern`PatternConvert[x_ ~~ x_ ~~ "2"]
>>
>> Out[6]= {"(?ms)(.)\\12", {{Hold[x], 1}}, {}, Hold[None]}
>>
>> The sequence \\12 is the backreference number 12, not backreference 1
>> followed by "2". The pattern should have been "(.)(?:\\1)2".
>
> Yes, that's clearly a bug which will get fixed for the next release.
>
> Oyvind Tafjord
> Wolfram Research
>

When we use RegularExpression[(.)]?test -> "$1" it just seems strange that  
there is a value assigned to \\1 and then this value is simply discarded.  
But one way or another, this is still not documented. In fact, the  
quantifier {m} isn't documented either, so we have to use something like  
{8, 8} if we want to stick to the documentation.

As for freely using all the other PCRE features, here's one example:

In[1]:= StringMatchQ["x", RegularExpression[
   "(?x)(?P<rec> ( | x (?P>rec) y)?)"]]

Out[1]= True

I think there isn't any possible way how "x" can be matched without "y".  
So which of the undocumented constructs weren't we supposed to use here?  
Most likely the recursion.

Maxim Rytin
m.r at inbox.ru


  • Prev by Date: Re: Warning from Piecewise
  • Next by Date: Plot problem
  • Previous by thread: Re: Re: Pure Function for String Selection
  • Next by thread: Re:Re: Pure Function for String Selection