MathGroup Archive 1998

[Date Index] [Thread Index] [Author Index]

Search the Archive

Re: String patterns


  • To: mathgroup@smc.vnet.net
  • Subject: [mg10943] Re: String patterns
  • From: Daniel Lichtblau <danl@wolfram.com>
  • Date: Sat, 14 Feb 1998 00:53:17 -0500
  • Organization: Wolfram Research, Inc.
  • References: <6brf6t$e78@smc.vnet.net>

MJE wrote:
> 
> Programming challenge:
> 
> Is there an elegant means of doing cryptanalysis in Mathematica as
> opposed to any other language.  I am mainly thinking of
> pattern-matching functions.  In this case, the pattern would be
> dynamic, not predefined.  I am not certain how to create and test
> patterns on the fly.
> 
> The primary task is to count letter, digraph, trigraph, and higher-order
> frequencies.
> 
> Output for the trigraph case might look like this:
> 
> THE    0.01350000
> AND    0.00709421
> ION    0.00559429
> ING    0.00510783
> TIO    0.00466191
> ENT    0.00458083
> RES    0.00417545
>    <...etc....>
> BEP    0.00004054
> 
> The real number represents the fractional occurrence of the trigraph
> among all trigraphs in the sample.  These were computed by a DOS
> utility on a particular sample text.  The word "the" occurred 333 times
> out of 24668 total trigraph sequences, giving an estimated probability
> for this trigraph of 333/24668=0.01350000.
> 
> Trigraphs overlap.  If I parse the following phrase,
> 
>      "I love Mathematica"
> 
> then the first trigraph is "I l" (spaces count), the second is " lo",
> and the third is "lov".
> 
> One must define an "alphabet" with a sorting order.  A good way to do
> this is with a string variable like this:
> 
>      "abcdefghijklmno..."
> 
> How good is Mathematica at this kind of string manipultion and
> searching?
> 
> Mark Evans
> evans@gte.net


To find frequencies of a small set of given trigraphs you might use
StringPosition.

In[23]:= str = "I love Mathematica because it has Mathieu functions,
matrix operations, and pattern matching.";

In[24]:= strL = ToLowerCase[str];
General::spell1:
   Possible spelling error: new symbol name "strL"
     is similar to existing symbol "str".

In[26]:= Length[StringPosition[strL, "mat"]] Out[26]= 5

To check frequencies of all triads that occur in your string you first
might form the triads explicitly, as below.
 
triads = Union[Table[StringTake[strL, {j,j+2}], {j,StringLength[strL]-2
}]];

Then you could do

In[51]:= Map[Length[StringPosition[strL,#]]&, triads]
Out[51]= {1, 1, 1, 1, 1, 1, 2, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1,
>    1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 4, 5, 1,
>    1, 1, 1, 2, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 2,
>    1, 1, 1, 1, 1, 1, 1}

If you are working with large strings and many triads, a more efficient
method might be to initialize a set of function values, one entry per
triad, to zeroes. For example,

In[54]:= Do[freq[triads[[j]]] = 0, {j,Length[triads]}]

In[56]:= ?freq
Global`freq
freq[", a"] = 0
freq["a b"] = 0
...

Then iterate over the string, and for each triad you find increment the
appropriate function value. Takes a bit of coding (not too much) but
should be reasonably fast.


Daniel Lichtblau
Wolfram Research



  • Prev by Date: Re: question on ErrorBar
  • Next by Date: Re: Active Plots
  • Prev by thread: Re: String patterns
  • Next by thread: Re: Re: String patterns