Re: Pure Function for String Selection
- To: mathgroup at smc.vnet.net
- Subject: [mg61594] Re: Pure Function for String Selection
- From: "dkr" <dkrjeg at adelphia.net>
- Date: Sun, 23 Oct 2005 05:45:57 -0400 (EDT)
- References: <dht479$hl1$1@smc.vnet.net><di7ra7$kov$1@smc.vnet.net> <diabsp$ijr$1@smc.vnet.net>
- Sender: owner-wri-mathgroup at wolfram.com
Edson,
Below we compare three approaches to your string filtering problem:
filter10 (Maxim Rytin's approach)
filter6Alt2 (A slightly amended version of my filter6 approach,
discussed previously)
filter9 (A new, single string approach)
_________________
selectString10Q[str_String]:=
StringFreeQ[str,RegularExpression["1{8}|X{6}|2{6}"]];
filter10[origList:{__String}]:=Select[origList,selectString10Q];
_________________
selectString6Alt2Q[str_String]:=
StringCases[str,{"XXXXXX","222222","11111111"},1]==={};
filter6Alt2[origList:{__String}]:=Select[origList,selectString6Alt2Q];
_________________
filter9[origList:{__String}]:=
StringCases[
StringReplace[ToString[origList],
RegularExpression["1{8}|X{6}|2{6}"]:>""],
RegularExpression[
StringJoin["\\w{",#,",",#,"}"]&[
ToString[StringLength[First[origList]]]]]];
Here we form a single string from the original list of strings (though
unlike our previous filter7 case, we do not explicitly insert list
braces as delimiters), then replace all bad runs in this string with
"", and then, via StringCases, pick out all remaining runs of word
characters whose length is equal to the common length of the original
strings. The runs corresponding to the bad original strings will not
be picked out because their length has been reduced by the replacement
operation.
_________________
In[1]:=
makeList[strLen_,listLen_]:=
Table[StringJoin[{"1","2","X"}\[LeftDoubleBracket]
Table[Random[
Integer,{1,3}],{strLen}]\[RightDoubleBracket]],{listLen}];
In[2]:=
SeedRandom[1234];
egList1=makeList[14,1000];
egList2=makeList[30,1000];
egList3=makeList[30,20000];
egList4=makeList[100,20000];
Filter origList Time(secs)*
6Alt2 egList4 0.935
9 egList4 1.33
10 egList4 1.035
6Alt2 egList3 0.63
9 egList3 0.59
10 egList3 0.585
6Alt2 egList2 0.06
9 egList2 0.03
10 egList2 0.025
* Average of two runs. Mathematica was restarted before each run for
each filter.
In[2]:=
SeedRandom[5678];
egList1=makeList[14,1000];
egList2=makeList[30,1000];
egList3=makeList[30,20000];
egList4=makeList[100,20000];
Filter origList Time(secs)*
6Alt2 egList4 0.935
9 egList4 1.305
10 egList4 1.04
6Alt2 egList3 0.645
9 egList3 0.65
10 egList3 0.58
6Alt2 egList2 0.07
9 egList2 0.03
10 egList2 0.03
* Average of two runs. Mathematica was restarted before each run for
each filter.
Based upon these meager test results, there doesn't appear to be a
whole lot of difference between the filters, except that the single
string method may lag behind a bit for problems the size of egList4.
One thing I have noted in my testing is that it is faster to use
patterns like RegularExpression["1{8}|X{6}|2{6}"] or
"XXXXXX"|"222222"|"11111111" than it is to use
X~~X~~X~~X~~X~~X|2~~2~~2~~2~~2~~2|1~~1~~1~~1~~1~~1~~1~~1.