MathGroup Archive 2006

[Date Index] [Thread Index] [Author Index]

Search the Archive

Re: Re: Pattern match question

  • To: mathgroup at smc.vnet.net
  • Subject: [mg68439] Re: [mg68418] Re: Pattern match question
  • From: "Oyvind Tafjord" <tafjord at wolfram.com>
  • Date: Sat, 5 Aug 2006 03:46:39 -0400 (EDT)
  • References: <easj50$fq0$1@smc.vnet.net> <200608040759.DAA01079@smc.vnet.net>
  • Sender: owner-wri-mathgroup at wolfram.com

----- Original Message ----- 
From: <ab_def at prontomail.com>
To: mathgroup at smc.vnet.net
Subject: [mg68439] [mg68418] Re: Pattern match question


>
> glassymeow at yahoo.com wrote:
>> hi
>> txt = "ZACCZBNRCSAACXBXX";
>> letters = "ABC";
>> i want to find the first occurrences of any of the
>> six combinations of the letters of the set "ABC" Globally, and
>> without overlap option. and the space between letters does not
>> important.
>> in the above txt string the result must be:
>> Out[]:=
>> ACCZB
>> CSAACXB
>> i wish a solution using mathematica regular expressions.
>> the Regex pattern  (A|B|C).*?(A|B|C).*?(A|B|C)  will give the out:
>> ACC ,  BNRCSA  ,  ACXB   because it considers the permutations
>> and not the combinations
>> the following is an old fashion program which will emulate the human
>> pencil and
>> paper method,  will solve the problem, but i am sure there are a better
>> solutions.
>>
>> txt = "ZACCZBNRCSAACXBXX";
>> letters = "ABC";
>> ptrnLtrs = "";
>> (* make a string of 26 zero's as the number of the alphbet*)
>> For[i = 1, i <= 26, ptrnLtrs = StringJoin[ptrnLtrs, "0"]; i++]
>> (* replace every letter of the pattern letters *)
>> (* with a corresponding 1 in the string of the zero's *)
>> For[i = 1, i <= StringLength[letters],
>>     num = ToCharacterCode[StringTake[letters, {i, i}]];
>>     num = num - 64;
>>     ptrnLtrs = StringReplacePart[ptrnLtrs, "1", Flatten[{num, num}]];
>>     i++];
>>
>> (* the procedural pattern match *)
>> ptrnLtrsBak = ptrnLtrs; y = 0;    (* backup for the ptrnLtrs *)
>> beginFlag = 0; result = ""; lst = {};
>> For[i = 1, i <= StringLength[txt],
>>     OneLetter = StringTake[txt, {i, i}];
>>     If[beginFlag == 0 && StringCases[letters, OneLetter] == {},
>>        Goto[jmp]];
>>      num = ToCharacterCode[StringTake[txt, {i, i}]] - 64;
>>      If[StringTake[ptrnLtrs, num] == "1",
>>         result = StringJoin[result, OneLetter];
>>         ptrnLtrs = StringReplacePart[ptrnLtrs, "0", Flatten[{num,
>> num}]];
>>         , result = StringJoin[result, OneLetter];];
>>       beginFlag = 1;
>>       If[ToExpression[ptrnLtrs] == 0, ptrnLtrs = ptrnLtrsBak;
>>           Print[result];
>>           result = "";  beginFlag = 0;];
>>        Label[jmp];
>>        i++]
>>
>> Out[]:=
>> ACCZB
>> CSAACXB
>>
>> regards
>> peter glassy
>
> Here's a solution that uses string expressions:
>
> In[1]:= Module[
>  {Lpatt = StringExpression @@@ (Insert[#, ___, {{2}, {3}}]&) /@
>      Permutations@ {"A", "B", "C"}},
>  StringCases["ZACCZBNRCSAACXBXX",
>    ShortestMatch[s__] /; StringMatchQ[s, Lpatt]]]
>
> Out[1]= {"ACCZB", "CSAACXB"}
>
> Note that in StringCases a list of patterns {patt1, patt2, ...} is
> equivalent to patt1 | patt2 | ... . We cannot directly use
> ShortestMatch[patt1 | patt2] because this merely makes all the
> quantifiers in the regex lazy but doesn't guarantee that we get the
> shortest possible match:
>
> In[2]:= StringPattern`PatternConvert[
>  ShortestMatch[("A" ~~ ___ ~~ "B") | ("A" ~~ ___ ~~ "C")]]
>
> Out[2]= {"(?ms)(?:A.*?B|A.*?C)", {}, {}, Hold[None]}
>
> In[3]:= StringCases["ACB",
>  ShortestMatch[("A" ~~ ___ ~~ "B") | ("A" ~~ ___ ~~ "C")]]
>
> Out[3]= {"ACB"}
>
> The shortest match would be "AC". So it's interesting to consider how
> we can obtain the same answer {"ACCZB", "CSAACXB"} using only
> RegularExpression without external conditions.
>
> Maxim Rytin
> m.r at inbox.ru

Yes, because of the underlying algorithm used, it's true that ShortestMatch 
does not affect Alternatives, only quantifiers like "__" and "..".

Here are a couple of regular expression approaches:

In[1]:= StringCases["ZACCZBNRCSAACXBXX",
RegularExpression[
"A.*?(?:B.*?C|C.*?B)|B.*?(?:A.*?C|C.*?A)|C.*?(?:A.*?B|B.*?A)"]]

Out[1]= {"ACCZB", "CSAACXB"}

In[2]:= StringCases["ZACCZBNRCSAACXBXX",
RegularExpression["(A|B|C).*?(?!\\1)(A|B|C).*?(?!\\1)(?!\\2)(?:A|B|C)"]]

Out[2]= {"ACCZB", "CSAACXB"}

The first is similar to Maxim's pattern, except it makes sure that each 
alternative can only be matched once, this also helps with efficiency. It 
can also be written as a StringExpression (see below).

The second is similar to Peter's original pattern, but it uses a "negative 
lookahead" to make sure that the same character is not repeated. There is no 
StringExpression equivalent of this.

The first pattern is probably slightly faster in general (if it's confusing 
to read, remember that .*? is basically ShortestMatch[___] and (?:...) is 
the way to group things without wasting time on storing the subpattern). 
Here is the StringExpression version of it:

In[3]:=
StringCases["ZACCZBNRCSAACXBXX",
ShortestMatch[
("A" ~~ ___ ~~ (("B" ~~ ___ ~~ "C") | ("C" ~~ ___ ~~ "B"))) |
("B" ~~ ___ ~~ (("A" ~~ ___ ~~ "C") | ("C" ~~ ___ ~~ "A"))) |
("C" ~~ ___ ~~ (("A" ~~ ___ ~~ "B") | ("B" ~~ ___ ~~ "A")))]]

Out[3]= {"ACCZB", "CSAACXB"}

Oyvind Tafjord
Wolfram Research 


  • Prev by Date: Re: Re: Re: Finding the Number of Pythagorean Triples below a bound
  • Next by Date: Re: Re: Re: "No more memory available" -- a recurring problem
  • Previous by thread: Re: Pattern match question
  • Next by thread: Re: Pattern match question