En:Re:String filtering problem
- To: mathgroup at smc.vnet.net
- Subject: [mg61929] En:Re:String filtering problem
- From: "Edson Ferreira" <edsferr at uol.com.br>
- Date: Sat, 5 Nov 2005 01:52:11 -0500 (EST)
- Sender: owner-wri-mathgroup at wolfram.com
dkr,
Thanks for your new solution.
I'm going to test it this weekend.
As far as I read we have a new winner in terms of performance.
Thanks!!!!!!!!!!
> Edson,
>
> Here is one last crack at your filtering problem. It is much simpler than
my previous filters and very competitive in terms of speed.
>
> filter11[origList:{__String}]:=
> StringCases[ToString[origList],
> RegularExpression["\\b(?![^,]*(XXXXXX|222222|11111111))[^,]+\\b"]];
>
> We simply form a master string from your list of strings using ToString,
and then use a Regular Expression to weed out the original strings with
bad runs.
> Explanation of the regular expression:
> If we wanted to simply pull out the original strings from the master string,
we could do this using
> StringCases[ToString[origList], RegularExpression[\\b"[^,]+\\b"]];
> The regular expression characterizes strings that lie between word boundaries (in this example the lefthand word boundaries take the form of either { or whitespace, while the righthand word boundaries take the form of either a comma or a righthand brace ) and consist of 1 or more character s that are not commas. [^,]+ will match as large a string as possible, and hence your original strings will be generated. Then to generate only those that don't have bad runs we insert the "negative lookahead" condition (?![^,]*(XXXXXX|222222|11111111)). It essentially requires that the foll owing text cannot begin with 0 or more characters that are not commas followed by a bad run. This suffices to rule out your bad strings. Since I am a novice as far as regular expressions go, it is likely that somewhat can suggest an alternative regular expression that will be even faster.
>
> Below I have repeated the tables from my previous message, adding a line for filter11 to each table.
>
> In[1]:=
> makeList[strLen_,listLen_]:=
> Table[StringJoin[{"1","2","X"}\[LeftDoubleBracket]
> Table[Random[
> Integer,{1,3}],{strLen}]\[RightDoubleBracket]],{listLen}];
> In[2]:=
> SeedRandom[1234];
> egList1=makeList[14,1000];
> egList2=makeList[30,1000];
> egList3=makeList[30,20000];
> egList4=makeList[100,20000];
>
>
> 6Alt2 egList4 0.935
> 9 egList4 1.33
> 10 egList4 1.035
> 11 egList4 0.91
>
>
> 6Alt2 egList3 0.63
> 9 egList3 0.59
> 10 egList3 0.585
> 11 egList3 0.475
>
>
> 6Alt2 egList2 0.06
> 9 egList2 0.03
> 10 egList2 0.025
> 11 egList2 0.02
>
>
> * Average of two runs. Mathematica was restarted before each run for
> each filter.
>
>
> In[2]:=
> SeedRandom[5678];
> egList1=makeList[14,1000];
> egList2=makeList[30,1000];
> egList3=makeList[30,20000];
> egList4=makeList[100,20000];
>
>
> Filter origList Time(secs)*
> 6Alt2 egList4 0.935
> 9 egList4 1.305
> 10 egList4 1.04
> 11 egList4 0.975
>
>
> 6Alt2 egList3 0.645
> 9 egList3 0.65
> 10 egList3 0.58
> 11 egList3 0.465
>
>
> 6Alt2 egList2 0.07
> 9 egList2 0.03
> 10 egList2 0.03
> 11 egList2 0.025
>
>
> * Average of two runs. Mathematica was restarted before each run for
> each filter.
>
> Thus, as with your earlier string reduction problem, using a master string and exploiting Mathematica's powerful string pattern capabilites may be a useful approach, especially when coupled with Maxim Rytin's excellent suggestion of using regular expressions. I don't believe there is an analogue in Mathematica's StringExpression for the type of lookahead condition that was used in filter11.
>
> dkr
>
>