Re: Re:Re: Set of strings reducing problem (Again!)

*To*: mathgroup at smc.vnet.net*Subject*: [mg60666] Re: Re:[mg60599] Re: Set of strings reducing problem (Again!)*From*: <dkrjeg at adelphia.net>*Date*: Fri, 23 Sep 2005 04:20:18 -0400 (EDT)*Sender*: owner-wri-mathgroup at wolfram.com

---- Edson Ferreira <edsferr at uol.com.br> wrote: > dkr, > > I have tested your approach to the set of strings reduction problem. > All I can say is that I astonished. Your code is up to 27 times faster than the one I've got. > Awesome! > > If I put the option Shortmatch->True then your code is up to 14 times faster. > > For example, I got a list with 864 strings. I applied some filter on it and got a new one with 237 strings. Then I applied both my code and your code to this filtered set os strings. > > Although results are different with I put the shortmatch option,I mean the length of the reduced sets, both are correct. My code took 112 seconds while yours took 8 seconds to perform the reduction. Without the option, your code took 4 seconds! > > By the way, What do you mean by shortest pattern match? I didn't get it! It took more steps in your examples to perform the same reducing operation than with shortmatch->False. Could your clarify this concept? Edson, From the Mathematica book, Section 2.8.4: "When you give an object such as x__ or e.. in a string pattern, Mathematica normally assumes that you want this to match the longest possible sequence of characters. Sometimes, however, you may instead want to match the shortest possible sequence of characters. You can specify this using ShortestMatch[p]." In[1]:= StringCases["aaabb","aaa"~~x___] Out[1]= {aaabb} In[2]:= StringCases["aaabb",ShortestMatch["aaa"~~x___]] Out[2]= {aaa} In the first case above, the longest string that matches the pattern is found. In the second case, the shortest string that matches the pattern is found. However, one has to read the above quote carefully, as the following example indicates: In[13]:= StringCases["{111}{11X}{112}{1X1}", ShortestMatch[ StringExpression["{",a:(WordCharacter)...,p:(WordCharacter), c:(WordCharacter)...,"}",z___,"{",a___,q:(WordCharacter),c___, "}"]]]//InputForm Out[13]//InputForm= {"{111}{11X}{112}{1X1}"} In[14]:= StringCases["{111}{11X}{112}{1X1}", StringExpression["{",a:(WordCharacter)...,p:(WordCharacter), c:(WordCharacter)...,"}",z___,"{",a___,q:(WordCharacter),c___, "}"]]//InputForm Out[14]//InputForm= {"{111}{11X}{112}"} In this example, using ShortestMatch picks out the LONGER string. By examining how the individual parts of the pattern are matched, we can get a better idea of what is happening: In[15]:= StringCases["{111}{11X}{112}{1X1}", ShortestMatch[ StringExpression["{",a:(WordCharacter)...,p:(WordCharacter), c:(WordCharacter)...,"}",z___,"{",a___,q:(WordCharacter),c___, "}"]]:>{{a},{p},{c},{z},{q}}]//InputForm Out[15]//InputForm= {{{"1"}, {"1"}, {"1"}, {"{11X}{112}"}, {"X"}}} In[16]:= StringCases["{111}{11X}{112}{1X1}", StringExpression["{",a:(WordCharacter)...,p:(WordCharacter), c:(WordCharacter)...,"}",z___,"{",a___,q:(WordCharacter),c___, "}"]:>{{a},{p},{c},{z},{q}}]//InputForm Out[16]//InputForm= {{{"11"}, {"1"}, {""}, {"{11X}"}, {"2"}}} It appears that Shortest Match is trying to find the shortest string to play the role of the piece a, while the default behavior exhibited in Out[16] tries first to find the longest string to play the role of a. Once the choice of a is made, I presume that, if there were more than one choice that could be made for p (there isn't in this pattern), ShortestMatch would try to match that component with the shortest string and then so on down the line. In any event, the essential point is that in my reductionFn, using the option shortmatch (the choice of terminology might not be the best), you may in general generate a different sequence of pattern matches than by using the default behavior of the string pattern matcher. To see this, let's look at the sequence of matches that are generated in reductionFn during the first iteration of StringReplace, first using the default behavior of the string pattern matcher, and then using the shortmatch option set to True. In[1]:= origList={"111","11X","112","1X1","1XX","1X2","121","12X","122"}; str=ToString[{#}]&/@origList//StringJoin Out[2]= {111}{11X}{112}{1X1}{1XX}{1X2}{121}{12X}{122} In[3]:= %//InputForm Out[3]//InputForm= "{111}{11X}{112}{1X1}{1XX}{1X2}{121}{12X}{122}" In[4]:= rr={{"1","X"},{"1","2"},{"1","U"},{"2","X"},{"M","X"},{"2","D"}}; Attributes[ss]=Orderless; ss[i_String,j_String]/;MemberQ[rr,{i,j}]:=ss[i,j]={i,j}; In[7]:= StringCases[str, StringExpression["{",a:(WordCharacter)...,p:(WordCharacter), c:(WordCharacter)...,"}",z___,"{",a___,q:(WordCharacter),c___,"}"]/; MatchQ[ss[p,q],_List]]//InputForm Out[7]//InputForm= {"{111}{11X}{112}", "{1X1}{1XX}{1X2}", "{121}{12X}{122}"} In[8]:= StringCases[str, ShortestMatch[ StringExpression["{",a:(WordCharacter)...,p:(WordCharacter), c:(WordCharacter)...,"}",z___,"{",a___,q:(WordCharacter),c___,"}"]/; MatchQ[ss[p,q],_List]]]//InputForm Out[8]//InputForm= {"{111}{11X}{112}{1X1}", "{1XX}{1X2}{121}{12X}"} Thus with shortmatch not set to True, 3 replacements will be made in the first iteration of StringReplace, while only 2 will be made with shortmatch set to True.