MathGroup Archive: November 2005 [00115]

[Date Index] [Thread Index] [Author Index]

En:Re:String filtering problem

To: mathgroup at smc.vnet.net
Subject: [mg61929] En:Re:String filtering problem
From: "Edson Ferreira" <edsferr at uol.com.br>
Date: Sat, 5 Nov 2005 01:52:11 -0500 (EST)
Sender: owner-wri-mathgroup at wolfram.com

dkr,

Thanks for your new solution. 
I'm going to test it this weekend. 

As far as I read we have a new winner in terms of performance. 

Thanks!!!!!!!!!! 

> Edson, 
> 
> Here is one last crack at your filtering problem. It is much simpler than
my previous filters and very competitive in terms of speed. 
> 
> filter11[origList:{__String}]:= 
> StringCases[ToString[origList], 
> RegularExpression["\\b(?![^,]*(XXXXXX|222222|11111111))[^,]+\\b"]]; 
> 
> We simply form a master string from your list of strings using ToString,
and then use a Regular Expression to weed out the original strings with
bad runs. 
> Explanation of the regular expression: 
> If we wanted to simply pull out the original strings from the master string,
we could do this using 
> StringCases[ToString[origList], RegularExpression[\\b"[^,]+\\b"]]; 
> The regular expression characterizes strings that lie between word boundaries (in this example the lefthand word boundaries take the form of either { or whitespace, while the righthand word boundaries take the form of either a comma or a righthand brace ) and consist of 1 or more character s that are not commas. [^,]+ will match as large a string as possible, and hence your original strings will be generated. Then to generate only those that don't have bad runs we insert the "negative lookahead" condition (?![^,]*(XXXXXX|222222|11111111)). It essentially requires that the foll owing text cannot begin with 0 or more characters that are not commas followed by a bad run. This suffices to rule out your bad strings. Since I am a novice as far as regular expressions go, it is likely that somewhat can suggest an alternative regular expression that will be even faster. 
> 
> Below I have repeated the tables from my previous message, adding a line for filter11 to each table. 
> 
> In[1]:= 
> makeList[strLen_,listLen_]:= 
> Table[StringJoin[{"1","2","X"}\[LeftDoubleBracket] 
> Table[Random[ 
> Integer,{1,3}],{strLen}]\[RightDoubleBracket]],{listLen}]; 
> In[2]:= 
> SeedRandom[1234]; 
> egList1=makeList[14,1000]; 
> egList2=makeList[30,1000]; 
> egList3=makeList[30,20000]; 
> egList4=makeList[100,20000]; 
> 
> 
> 6Alt2 egList4 0.935 
> 9 egList4 1.33 
> 10 egList4 1.035 
> 11 egList4 0.91 
> 
> 
> 6Alt2 egList3 0.63 
> 9 egList3 0.59 
> 10 egList3 0.585 
> 11 egList3 0.475 
> 
> 
> 6Alt2 egList2 0.06 
> 9 egList2 0.03 
> 10 egList2 0.025 
> 11 egList2 0.02 
> 
> 
> * Average of two runs. Mathematica was restarted before each run for 
> each filter. 
> 
> 
> In[2]:= 
> SeedRandom[5678]; 
> egList1=makeList[14,1000]; 
> egList2=makeList[30,1000]; 
> egList3=makeList[30,20000]; 
> egList4=makeList[100,20000]; 
> 
> 
> Filter origList Time(secs)* 
> 6Alt2 egList4 0.935 
> 9 egList4 1.305 
> 10 egList4 1.04 
> 11 egList4 0.975 
> 
> 
> 6Alt2 egList3 0.645 
> 9 egList3 0.65 
> 10 egList3 0.58 
> 11 egList3 0.465 
> 
> 
> 6Alt2 egList2 0.07 
> 9 egList2 0.03 
> 10 egList2 0.03 
> 11 egList2 0.025 
> 
> 
> * Average of two runs. Mathematica was restarted before each run for 
> each filter. 
> 
> Thus, as with your earlier string reduction problem, using a master string and exploiting Mathematica's powerful string pattern capabilites may be a useful approach, especially when coupled with Maxim Rytin's excellent suggestion of using regular expressions. I don't believe there is an analogue in Mathematica's StringExpression for the type of lookahead condition that was used in filter11. 
> 
> dkr 
> 
>

Prev by Date: Re: Use of Mathematica with Rule-based Equation Derivations

Next by Date: Re: Re: 2.9.2 How Input and Output Work

Previous by thread: MathML->SymbolicXML->MathML

Next by thread: Losing Digits