MathGroup Archive: December 2002 [00282]

[Date Index] [Thread Index] [Author Index]

Re: Efficient Sorting Algorithm

To: mathgroup at smc.vnet.net
Subject: [mg38326] Re: Efficient Sorting Algorithm
From: bghiggins at ucdavis.edu (Brian Higgins)
Date: Thu, 12 Dec 2002 01:35:43 -0500 (EST)
References: <asn4ng$465$1@smc.vnet.net> <at4che$er3$1@smc.vnet.net>
Sender: owner-wri-mathgroup at wolfram.com

Many thanks for all the suggestions from Allan, Harmut, Daniel. In my
original post I did not specify why I was interested in the sorting
algorithm. The purpose was to generate what are called DNA dot plots
which can be used to compare two DNA sequences. For eaxmple, consider
the following sequences

myDNA1 = "TCTGCTTTCTTCCAAATTGATGCTGGATAGAGGTGTTTATTTCTATTCTCATATTCCTAA
GTAAAACAGATAACTGCCTCTCAACTATATCAAGTAGACTAAAATATTGTGCGTCCTGAACCTCTAAGTATGCGTCCTGAACCTCTAAGTATCATATTCCTAAGTAAAACAGATAACTGCCTCTCAACTATATCAAGTAGACTAAAATATTGTCGGGTGCCTGTAGTCCCAGCTACTCGGGAGGCTGAGGCAGGAGAATGGCGTGAACCTGGAAGGCAGAGCTGCAGTGAGCAGAGATCG";

myDNA2 = "ACAAGAAGGCTGCTGCCACCAGCCTGTGAAGCAAGGTTAAGGTGAGAAGGCTGGAGGTGAGATTCTGGGCAGGTAGGTACTGGAAGCCGGGGCAAGGTGCAGAAAGGCAGAAAGTGTTTCTGAAAGAGGGATTAGCCCGTTGTCTTACACATATTCCTAAGTAAAACAGATAACTGCCTCTCAACTATATCAAGTAGACTAAAATATTGTGCGTCCTGAACCTCTAAGTATGCGTCCTGAACCTCTAAGTATCATATTCCTAAGTAAAACAGATAACTGCCTCTCAACTATATCAAGTAGACTAAAATATTGTTAGTCTGACTTTGCACCTGCTCTGTGATTATGACTATCCCACAGTCTCCTA";

These sequences were "engineered" to have several common blocks of
DNA. Using the following program a plot is generated that shows the
alignment of the DNA blocks ( I have used one of Allan's algorithms
(Matchings7) for searching for common DNA blocks

dotPlot1[dna1_, dna2_, BlockSize_, offset_] := Module[{st, pt, sp, s1,
s2, s, rr},
s1 = MapIndexed[List[#, (First[#2] - 1)offset + 1] &, 
Map[StringJoin[#] &, Partition[Characters[dna1], BlockSize, offset],  
 1]];
s2 = MapIndexed[List[#, (First[#2] - 1)offset + 1] &,  
Map[StringJoin[#] &, Partition[Characters[dna2], BlockSize, offset], 
1]];
s = {s1, s2};   
 st = #[[Ordering[#[[All, 1]]]]] & /@ s;   
 sp = Split[#, #1[[1]] === #2[[1]] &] & /@ st; 
 pt = (Alternatives @@ (Intersection @@ st[[All, All, 1]]));
   rr = Transpose[Cases[#, {{pt, _}, ___}] & /@ sp];
   ListPlot[ Flatten[Apply[Outer[List, ##, 1] &, rr, {1}],    2] /.
{{a_String, x1_}, {b_String, y1_}} -> {x1, y1},  AspectRatio -> 1,
PlotStyle -> {RGBColor[0, 0, 1]}, Frame -> True,       FrameLabel ->
{"DNA 1", "DNA 2"}, RotateLabel -> False, Axes -> False]]

dotPlot1[myDNA1, myDNA2, 6, 1]

One of the reasons Allan's code is so efficient is the use of patterns
in Cases and Intersection. The question then arises how would one
proceed if  the test is not the equality of blocks, but some other
measure. For example 2 blocks of 8 nucleotides are "equal" if at least
6 of the eight bases match , e,g,

newTest:=Count[Characters[#1]-Characters[#2],0]>6&

I have attempted a search using such a test but  my code cannnot
compete with Allan's or  the various variants that have been
suggested. The bottleneck is in Cases or Intersection.

I did find that if my test used StringMatchQ[block1, block2,
SpellingCorrection-> True], I got quite good performance, though this
test is not precise, and I was not able to extend it. Would be nice to
be able to modify the criteria used in SpellingCorrection so that it
gives True for user defined criteria.

Brian



 to Wol"Allan Hayes" <hay at haystack.demon.co.uk> wrote in message news:<at4che$er3$1 at smc.vnet.net>...
> I give below some further speed-ups.
> The improvement, at least on  the data I used, is due to using
> Split[Sort[_]] right at the start to parcel the data - it probably depends
> on there being a lot of repetition of the first term of the entries in the
> data.
> The code for Matchings6 below is faster than Daniel Lichblau's recently
> posted code - though he suggests that this might be speeded up by
> substituting numbers for strings and compiling.
> 
> Data
> 
>     s1 = Table[{FromCharacterCode[Table[Random[Integer, {65, 69}], {4}]],
>             Random[Integer, {1, 2000}]}, {6000}];
>     s2 = Table[{FromCharacterCode[Table[Random[Integer, {65, 69}], {4}]],
>             Random[Integer, {1, 2000}]}, {12000}];
> 
> Get the members of s1 with the same first entry as some member of s2.
> 
> Previous code (preserves order)
> 
>     Matched1[s1_,s2_]:=
>       Cases[ s1,{Alternatives@@Union[s2[[All,1]]],_}]
> 
>     Matched1[s1,s2];//Timing
> 
>         {3.4 Second,Null}
> 
> New code (does not preserve order) - note the use of  s1[[Ordering[s1[[All,
> 1]]]]] to save on ordering with respect to the second entries
> 
>     Matched4[s1_,s2_]:=
>           Cases[Split[s1[[Ordering[s1[[All,1]]]]], #1[[1]]===#2[[1]]&],
>                 {{Alternatives@@Union[s2[[All,1]]],_},___}]
> 
>     Matched4[s1,s2];//Timing
> 
>     {1.6 Second,Null}
> 
> 
> Get the full matchings (this code will work on {s1,s2 ,....,sn}, as well as
> just {s1, s2})
> 
>    Matchings7[s_]:=
>     Module[{st,pt,sp},
>       st = #[[Ordering[#[[All,1]]]]]&/@s;
>       sp=Split[#,#1[[1]]===#2[[1]]&]&/@st;
>       pt= (Alternatives@@(Intersection@@ st[[All,All,1]]));
>       Transpose[Cases[#,{{pt,_},___}]&/@sp]
>       ];
> 
>     (ms7=Matchings7[{s1,s2}]);//Timing
> 
>         {3.46 Second,Null}
> 
> 
> Daniel Lichtblau (gives essentially the same information as Matchings7)
> 
>     myTest[l1_, l2_] :=
>     Module[{s1, s2, m = Length[l1], n = Length[l2], j, k, res = {}, ord },
>         s1 = Sort[l1]; s2 = Sort[l2];
>         For[j = 1; k = 1, j <= m && k <= n, Null,
>              ord = Order[s1[[j,1]], s2[[k,1]]];
>              If[ord == 1, j++; Continue[]];
>              If[ord == -1, k++; Continue[]];
>              res = {res, {s1[[j]], s2[[k]]}};
>              j++; k++;
>         ];
>         Partition[Partition[Flatten[res], 2], 2]
>     ]
> 
>     myTest[s1,s2];//Timing
> 
>         {7.47 Second,Null}
> 
> 
> --
> Allan
> 
> ---------------------
> Allan Hayes
> Mathematica Training and Consulting
> Leicester UK
> www.haystack.demon.co.uk
> hay at haystack.demon.co.uk
> Voice: +44 (0)116 271 4198
> Fax: +44 (0)870 164 0565

Prev by Date: Re: Question on factor group calculations

Next by Date: RE: Efficient Sorting Algorithm ++ SortSplit1 and Merge

Previous by thread: Re: Efficient Sorting Algorithm

Next by thread: Copying/Exporting graphics to other applications