Re: string pattern search
- To: mathgroup at smc.vnet.net
- Subject: [mg57373] Re: [mg57305] string pattern search
- From: "Wolf, Hartmut" <Hartmut.Wolf at t-systems.com>
- Date: Wed, 25 May 2005 06:03:24 -0400 (EDT)
- Sender: owner-wri-mathgroup at wolfram.com
>-----Original Message----- >From: zak [mailto:chocolatez at gmail.com] To: mathgroup at smc.vnet.net >Sent: Tuesday, May 24, 2005 11:13 AM >Subject: [mg57373] [mg57305] string pattern search > >Hi >how to search for the pattern "abc" in the string >"aiibbaecXXacarbiiicc" >and the letters space between a,b,c is not important, so the >output will be: >1-when overlapping not allowed: >{aiibbaec},{acarbiiic} >2-and when we consider overlapping: >{aiibbaec},{aecXXacarbiiic},{acarbiiic},{arbiiic} > [...] I can't help you with string pattern matching as my Mathematica version is old (4.1) :-( but I might illustrate pattern matching in general First "abc" is not yet a "pattern" to give the matches indicated, you would allow arbitrary characters in between. The task of matching characters in a stream can be mapped to matching elements in a list by In[1]:= xx = Characters["aiibbaecXXacarbiiicc"] Out[1]= {"a", "i", "i", "b", "b", "a", "e", "c", "X", "X", "a", "c", "a", "r", "b", "i", "i", "i", "c", "c"} When you want "non overlapping matches", repeatedly try on your list keeping only the still unmatched parts. You can travel down the list easily with FixedPoint, and to collect the results use FixedPointList: In[2]:= FixedPointList[ Replace[Last[#], {{___,"a",atr___,"b",btr___,"c",rest__}:> {StringJoin["a",atr,"b",btr,"c"], {rest}}, _:> {stop}}]&, {xx}]; In[3]:= Take[%,{2,-3}][[All,1]] Out[3]= {"aiibbaec", "acarbiiic"} When you consider overlapping, you obviously retry after moving away from the first matching "a"; such accordingly: In[4]:= FixedPointList[ Replace[Last[#], {{___,"a",atr___,"b",btr___,"c",rest__}:> {StringJoin["a",atr,"b",btr,"c"], {atr,"b",btr,"c",rest}}, _:> {stop}}]&, {xx}]; In[5]:= Take[%,{2,-3}][[All,1]] Out[5]= {"aiibbaec", "aecXXacarbiiic", "acarbiiic", "arbiiic"} These are, however, not all possible matches. The Mathematica machinery for that is ReplaceList: In[6]:= ReplaceList[xx,{___,"a",atr___,"b",btr___,"c",___}:> StringJoin["a",atr,"b",btr,"c"]]; In[7]:= First/@Split[%] Out[7]= {"aiibbaec", "aiibbaecXXac", "aiibbaecXXacarbiiic", "aecXXacarbiiic", "acarbiiic", "arbiiic", "aiibbaecXXacarbiiicc", "aecXXacarbiiicc", "acarbiiicc", "arbiiicc"} Here I identified those matches which just differ on the position of the "intermediate" matching "b" (with same matching positions for "a" and "c"). If you want to see all posible matches, indicating the matching positions, you might like to evaluate this: ReplaceList[xx, {h___,"a",atr___,"b",btr___,"c",t___}:> GridBox[{Join[{h},StyleForm[#, FontWeight->"Bold", FontColor->Hue[0,1,.7]]& /@{"a",atr,"b",btr,"c"},{t}], {Sequence@@Table[" ",{Length[{h}]}], "\[UpArrow]",Sequence@@Table[" ",{Length[{atr}]}], "\[UpArrow]",Sequence@@Table[" ",{Length[{btr}]}],"\[UpArrow]", Sequence@@Table[" ",{Length[{t}]}]}}] ]//ColumnForm//DisplayForm The matching part here is highlighted, arrows indicate the positions of the matching characters. -- Hartmut Wolf