En:Re:String filtering problem
- To: mathgroup at smc.vnet.net
- Subject: [mg61929] En:Re:String filtering problem
- From: "Edson Ferreira" <edsferr at uol.com.br>
- Date: Sat, 5 Nov 2005 01:52:11 -0500 (EST)
- Sender: owner-wri-mathgroup at wolfram.com
dkr, Thanks for your new solution. I'm going to test it this weekend. As far as I read we have a new winner in terms of performance. Thanks!!!!!!!!!! > Edson, > > Here is one last crack at your filtering problem. It is much simpler than my previous filters and very competitive in terms of speed. > > filter11[origList:{__String}]:= > StringCases[ToString[origList], > RegularExpression["\\b(?![^,]*(XXXXXX|222222|11111111))[^,]+\\b"]]; > > We simply form a master string from your list of strings using ToString, and then use a Regular Expression to weed out the original strings with bad runs. > Explanation of the regular expression: > If we wanted to simply pull out the original strings from the master string, we could do this using > StringCases[ToString[origList], RegularExpression[\\b"[^,]+\\b"]]; > The regular expression characterizes strings that lie between word boundaries (in this example the lefthand word boundaries take the form of either { or whitespace, while the righthand word boundaries take the form of either a comma or a righthand brace ) and consist of 1 or more character s that are not commas. [^,]+ will match as large a string as possible, and hence your original strings will be generated. Then to generate only those that don't have bad runs we insert the "negative lookahead" condition (?![^,]*(XXXXXX|222222|11111111)). It essentially requires that the foll owing text cannot begin with 0 or more characters that are not commas followed by a bad run. This suffices to rule out your bad strings. Since I am a novice as far as regular expressions go, it is likely that somewhat can suggest an alternative regular expression that will be even faster. > > Below I have repeated the tables from my previous message, adding a line for filter11 to each table. > > In[1]:= > makeList[strLen_,listLen_]:= > Table[StringJoin[{"1","2","X"}\[LeftDoubleBracket] > Table[Random[ > Integer,{1,3}],{strLen}]\[RightDoubleBracket]],{listLen}]; > In[2]:= > SeedRandom[1234]; > egList1=makeList[14,1000]; > egList2=makeList[30,1000]; > egList3=makeList[30,20000]; > egList4=makeList[100,20000]; > > > 6Alt2 egList4 0.935 > 9 egList4 1.33 > 10 egList4 1.035 > 11 egList4 0.91 > > > 6Alt2 egList3 0.63 > 9 egList3 0.59 > 10 egList3 0.585 > 11 egList3 0.475 > > > 6Alt2 egList2 0.06 > 9 egList2 0.03 > 10 egList2 0.025 > 11 egList2 0.02 > > > * Average of two runs. Mathematica was restarted before each run for > each filter. > > > In[2]:= > SeedRandom[5678]; > egList1=makeList[14,1000]; > egList2=makeList[30,1000]; > egList3=makeList[30,20000]; > egList4=makeList[100,20000]; > > > Filter origList Time(secs)* > 6Alt2 egList4 0.935 > 9 egList4 1.305 > 10 egList4 1.04 > 11 egList4 0.975 > > > 6Alt2 egList3 0.645 > 9 egList3 0.65 > 10 egList3 0.58 > 11 egList3 0.465 > > > 6Alt2 egList2 0.07 > 9 egList2 0.03 > 10 egList2 0.03 > 11 egList2 0.025 > > > * Average of two runs. Mathematica was restarted before each run for > each filter. > > Thus, as with your earlier string reduction problem, using a master string and exploiting Mathematica's powerful string pattern capabilites may be a useful approach, especially when coupled with Maxim Rytin's excellent suggestion of using regular expressions. I don't believe there is an analogue in Mathematica's StringExpression for the type of lookahead condition that was used in filter11. > > dkr > >