MathGroup Archive: October 2005 [00739]

[Date Index] [Thread Index] [Author Index]

Re: Pure Function for String Selection

To: mathgroup at smc.vnet.net
Subject: [mg61594] Re: Pure Function for String Selection
From: "dkr" <dkrjeg at adelphia.net>
Date: Sun, 23 Oct 2005 05:45:57 -0400 (EDT)
References: <dht479$hl1$1@smc.vnet.net><di7ra7$kov$1@smc.vnet.net> <diabsp$ijr$1@smc.vnet.net>
Sender: owner-wri-mathgroup at wolfram.com

Edson,

Below we compare three approaches to your string filtering problem:

filter10  (Maxim Rytin's approach)
filter6Alt2  (A slightly amended version of my filter6 approach,
discussed previously)
filter9  (A new, single string approach)
_________________

selectString10Q[str_String]:=
    StringFreeQ[str,RegularExpression["1{8}|X{6}|2{6}"]];
filter10[origList:{__String}]:=Select[origList,selectString10Q];

_________________

selectString6Alt2Q[str_String]:=
    StringCases[str,{"XXXXXX","222222","11111111"},1]==={};
filter6Alt2[origList:{__String}]:=Select[origList,selectString6Alt2Q];

_________________

filter9[origList:{__String}]:=
    StringCases[
      StringReplace[ToString[origList],
        RegularExpression["1{8}|X{6}|2{6}"]:>""],
      RegularExpression[
        StringJoin["\\w{",#,",",#,"}"]&[
          ToString[StringLength[First[origList]]]]]];

Here we form a single string from the original list of strings (though
unlike our previous filter7 case, we do not explicitly insert list
braces as delimiters), then replace all bad runs in this string with
"", and then, via StringCases, pick out all remaining runs of word
characters whose length is equal to the common length of the original
strings.  The runs corresponding to the bad original strings will not
be picked out because their length has been reduced by the replacement
operation.
_________________

In[1]:=
makeList[strLen_,listLen_]:=
    Table[StringJoin[{"1","2","X"}\[LeftDoubleBracket]
Table[Random[
Integer,{1,3}],{strLen}]\[RightDoubleBracket]],{listLen}];
In[2]:=
SeedRandom[1234];
egList1=makeList[14,1000];
egList2=makeList[30,1000];
egList3=makeList[30,20000];
egList4=makeList[100,20000];

Filter          origList          Time(secs)*
6Alt2          egList4          0.935
9                 egList4          1.33
10               egList4          1.035

6Alt2          egList3          0.63
9                 egList3          0.59
10               egList3          0.585

6Alt2          egList2          0.06
9                 egList2          0.03
10               egList2          0.025

* Average of two runs.  Mathematica was restarted before each run for
each filter.

In[2]:=
SeedRandom[5678];
egList1=makeList[14,1000];
egList2=makeList[30,1000];
egList3=makeList[30,20000];
egList4=makeList[100,20000];

Filter          origList          Time(secs)*
6Alt2          egList4          0.935
9                 egList4          1.305
10               egList4          1.04

6Alt2          egList3          0.645
9                 egList3          0.65
10               egList3          0.58

6Alt2          egList2          0.07
9                 egList2          0.03
10               egList2          0.03

* Average of two runs.  Mathematica was restarted before each run for
each filter.

Based upon these meager test results, there doesn't appear to be a
whole lot of difference between the filters, except that the single
string method may lag behind a bit for problems the size of egList4.
One thing I have noted in my testing is that it is faster to use
patterns like RegularExpression["1{8}|X{6}|2{6}"] or
"XXXXXX"|"222222"|"11111111" than it is to use
X~~X~~X~~X~~X~~X|2~~2~~2~~2~~2~~2|1~~1~~1~~1~~1~~1~~1~~1.

Prev by Date: Memory leak, once again.

Next by Date: Re: Language vs. Library why it matters / Turing

Previous by thread: Re:Re: Pure Function for String Selection

Next by thread: partitioning a string