MathGroup Archive 2005

[Date Index] [Thread Index] [Author Index]

Search the Archive

Re: string pattern search

  • To: mathgroup at smc.vnet.net
  • Subject: [mg57373] Re: [mg57305] string pattern search
  • From: "Wolf, Hartmut" <Hartmut.Wolf at t-systems.com>
  • Date: Wed, 25 May 2005 06:03:24 -0400 (EDT)
  • Sender: owner-wri-mathgroup at wolfram.com

>-----Original Message-----
>From: zak [mailto:chocolatez at gmail.com] 
To: mathgroup at smc.vnet.net
>Sent: Tuesday, May 24, 2005 11:13 AM
>Subject: [mg57373] [mg57305] string pattern search
>
>Hi
>how to search for the pattern "abc" in the string 
>"aiibbaecXXacarbiiicc"
>and the letters space between a,b,c is not important, so the 
>output will be:
>1-when overlapping not allowed:
>{aiibbaec},{acarbiiic}
>2-and when we consider overlapping:
>{aiibbaec},{aecXXacarbiiic},{acarbiiic},{arbiiic}
>
[...]


I can't help you with string pattern matching as my Mathematica version
is old (4.1) :-( but I might illustrate pattern matching in general 

First "abc" is not yet a "pattern" to give the matches indicated, you
would allow arbitrary characters in between.


The task of matching characters in a stream can be mapped to matching
elements in a list by 

In[1]:= xx = Characters["aiibbaecXXacarbiiicc"]
Out[1]=
{"a", "i", "i", "b", "b", "a", "e", "c", "X", "X", "a", "c", "a", "r",
"b",
"i", "i", "i", "c", "c"}


When you want "non overlapping matches", repeatedly try on your list
keeping only the still unmatched parts. You can travel down the list
easily with FixedPoint, and to collect the results use FixedPointList:

In[2]:=
FixedPointList[ Replace[Last[#],
{{___,"a",atr___,"b",btr___,"c",rest__}:>
{StringJoin["a",atr,"b",btr,"c"], {rest}}, _:> {stop}}]&, {xx}];

In[3]:= Take[%,{2,-3}][[All,1]]
Out[3]= {"aiibbaec", "acarbiiic"}



When you consider overlapping, you obviously retry after moving away
from the first matching "a"; such accordingly:

In[4]:=
FixedPointList[ Replace[Last[#],
{{___,"a",atr___,"b",btr___,"c",rest__}:>
{StringJoin["a",atr,"b",btr,"c"], {atr,"b",btr,"c",rest}}, _:>
{stop}}]&, {xx}];

In[5]:= Take[%,{2,-3}][[All,1]]
Out[5]=
{"aiibbaec", "aecXXacarbiiic", "acarbiiic", "arbiiic"}



These are, however, not all possible matches. The Mathematica machinery
for that is ReplaceList:

In[6]:=
ReplaceList[xx,{___,"a",atr___,"b",btr___,"c",___}:>
StringJoin["a",atr,"b",btr,"c"]];

In[7]:= First/@Split[%]
Out[7]=
{"aiibbaec", "aiibbaecXXac", "aiibbaecXXacarbiiic", "aecXXacarbiiic", 
"acarbiiic", "arbiiic", "aiibbaecXXacarbiiicc", "aecXXacarbiiicc", 
"acarbiiicc", "arbiiicc"}

Here I identified those matches which just differ on the position of the
"intermediate" matching "b" (with same matching positions for "a" and
"c").



If you want to see all posible matches, indicating the matching
positions, you might like to evaluate this:

ReplaceList[xx,
{h___,"a",atr___,"b",btr___,"c",t___}:> GridBox[{Join[{h},StyleForm[#,
FontWeight->"Bold", FontColor->Hue[0,1,.7]]&
/@{"a",atr,"b",btr,"c"},{t}],
{Sequence@@Table[" ",{Length[{h}]}],
"\[UpArrow]",Sequence@@Table[" ",{Length[{atr}]}],
"\[UpArrow]",Sequence@@Table[" ",{Length[{btr}]}],"\[UpArrow]",
Sequence@@Table[" ",{Length[{t}]}]}}]
]//ColumnForm//DisplayForm

The matching part here is highlighted, arrows indicate the positions of
the matching characters.

--
Hartmut Wolf


  • Prev by Date: Re: Re: Mapping Data
  • Next by Date: Re: Solve or Reduce on a monstrosity of an expresssion (and a prize!)
  • Previous by thread: Re: string pattern search
  • Next by thread: Re: string pattern search