Re: Efficient Sorting Algorithm
- To: mathgroup at smc.vnet.net
- Subject: [mg38326] Re: Efficient Sorting Algorithm
- From: bghiggins at ucdavis.edu (Brian Higgins)
- Date: Thu, 12 Dec 2002 01:35:43 -0500 (EST)
- References: <asn4ng$465$1@smc.vnet.net> <at4che$er3$1@smc.vnet.net>
- Sender: owner-wri-mathgroup at wolfram.com
Many thanks for all the suggestions from Allan, Harmut, Daniel. In my original post I did not specify why I was interested in the sorting algorithm. The purpose was to generate what are called DNA dot plots which can be used to compare two DNA sequences. For eaxmple, consider the following sequences myDNA1 = "TCTGCTTTCTTCCAAATTGATGCTGGATAGAGGTGTTTATTTCTATTCTCATATTCCTAA GTAAAACAGATAACTGCCTCTCAACTATATCAAGTAGACTAAAATATTGTGCGTCCTGAACCTCTAAGTATGCGTCCTGAACCTCTAAGTATCATATTCCTAAGTAAAACAGATAACTGCCTCTCAACTATATCAAGTAGACTAAAATATTGTCGGGTGCCTGTAGTCCCAGCTACTCGGGAGGCTGAGGCAGGAGAATGGCGTGAACCTGGAAGGCAGAGCTGCAGTGAGCAGAGATCG"; myDNA2 = "ACAAGAAGGCTGCTGCCACCAGCCTGTGAAGCAAGGTTAAGGTGAGAAGGCTGGAGGTGAGATTCTGGGCAGGTAGGTACTGGAAGCCGGGGCAAGGTGCAGAAAGGCAGAAAGTGTTTCTGAAAGAGGGATTAGCCCGTTGTCTTACACATATTCCTAAGTAAAACAGATAACTGCCTCTCAACTATATCAAGTAGACTAAAATATTGTGCGTCCTGAACCTCTAAGTATGCGTCCTGAACCTCTAAGTATCATATTCCTAAGTAAAACAGATAACTGCCTCTCAACTATATCAAGTAGACTAAAATATTGTTAGTCTGACTTTGCACCTGCTCTGTGATTATGACTATCCCACAGTCTCCTA"; These sequences were "engineered" to have several common blocks of DNA. Using the following program a plot is generated that shows the alignment of the DNA blocks ( I have used one of Allan's algorithms (Matchings7) for searching for common DNA blocks dotPlot1[dna1_, dna2_, BlockSize_, offset_] := Module[{st, pt, sp, s1, s2, s, rr}, s1 = MapIndexed[List[#, (First[#2] - 1)offset + 1] &, Map[StringJoin[#] &, Partition[Characters[dna1], BlockSize, offset], 1]]; s2 = MapIndexed[List[#, (First[#2] - 1)offset + 1] &, Map[StringJoin[#] &, Partition[Characters[dna2], BlockSize, offset], 1]]; s = {s1, s2}; st = #[[Ordering[#[[All, 1]]]]] & /@ s; sp = Split[#, #1[[1]] === #2[[1]] &] & /@ st; pt = (Alternatives @@ (Intersection @@ st[[All, All, 1]])); rr = Transpose[Cases[#, {{pt, _}, ___}] & /@ sp]; ListPlot[ Flatten[Apply[Outer[List, ##, 1] &, rr, {1}], 2] /. {{a_String, x1_}, {b_String, y1_}} -> {x1, y1}, AspectRatio -> 1, PlotStyle -> {RGBColor[0, 0, 1]}, Frame -> True, FrameLabel -> {"DNA 1", "DNA 2"}, RotateLabel -> False, Axes -> False]] dotPlot1[myDNA1, myDNA2, 6, 1] One of the reasons Allan's code is so efficient is the use of patterns in Cases and Intersection. The question then arises how would one proceed if the test is not the equality of blocks, but some other measure. For example 2 blocks of 8 nucleotides are "equal" if at least 6 of the eight bases match , e,g, newTest:=Count[Characters[#1]-Characters[#2],0]>6& I have attempted a search using such a test but my code cannnot compete with Allan's or the various variants that have been suggested. The bottleneck is in Cases or Intersection. I did find that if my test used StringMatchQ[block1, block2, SpellingCorrection-> True], I got quite good performance, though this test is not precise, and I was not able to extend it. Would be nice to be able to modify the criteria used in SpellingCorrection so that it gives True for user defined criteria. Brian to Wol"Allan Hayes" <hay at haystack.demon.co.uk> wrote in message news:<at4che$er3$1 at smc.vnet.net>... > I give below some further speed-ups. > The improvement, at least on the data I used, is due to using > Split[Sort[_]] right at the start to parcel the data - it probably depends > on there being a lot of repetition of the first term of the entries in the > data. > The code for Matchings6 below is faster than Daniel Lichblau's recently > posted code - though he suggests that this might be speeded up by > substituting numbers for strings and compiling. > > Data > > s1 = Table[{FromCharacterCode[Table[Random[Integer, {65, 69}], {4}]], > Random[Integer, {1, 2000}]}, {6000}]; > s2 = Table[{FromCharacterCode[Table[Random[Integer, {65, 69}], {4}]], > Random[Integer, {1, 2000}]}, {12000}]; > > Get the members of s1 with the same first entry as some member of s2. > > Previous code (preserves order) > > Matched1[s1_,s2_]:= > Cases[ s1,{Alternatives@@Union[s2[[All,1]]],_}] > > Matched1[s1,s2];//Timing > > {3.4 Second,Null} > > New code (does not preserve order) - note the use of s1[[Ordering[s1[[All, > 1]]]]] to save on ordering with respect to the second entries > > Matched4[s1_,s2_]:= > Cases[Split[s1[[Ordering[s1[[All,1]]]]], #1[[1]]===#2[[1]]&], > {{Alternatives@@Union[s2[[All,1]]],_},___}] > > Matched4[s1,s2];//Timing > > {1.6 Second,Null} > > > Get the full matchings (this code will work on {s1,s2 ,....,sn}, as well as > just {s1, s2}) > > Matchings7[s_]:= > Module[{st,pt,sp}, > st = #[[Ordering[#[[All,1]]]]]&/@s; > sp=Split[#,#1[[1]]===#2[[1]]&]&/@st; > pt= (Alternatives@@(Intersection@@ st[[All,All,1]])); > Transpose[Cases[#,{{pt,_},___}]&/@sp] > ]; > > (ms7=Matchings7[{s1,s2}]);//Timing > > {3.46 Second,Null} > > > Daniel Lichtblau (gives essentially the same information as Matchings7) > > myTest[l1_, l2_] := > Module[{s1, s2, m = Length[l1], n = Length[l2], j, k, res = {}, ord }, > s1 = Sort[l1]; s2 = Sort[l2]; > For[j = 1; k = 1, j <= m && k <= n, Null, > ord = Order[s1[[j,1]], s2[[k,1]]]; > If[ord == 1, j++; Continue[]]; > If[ord == -1, k++; Continue[]]; > res = {res, {s1[[j]], s2[[k]]}}; > j++; k++; > ]; > Partition[Partition[Flatten[res], 2], 2] > ] > > myTest[s1,s2];//Timing > > {7.47 Second,Null} > > > -- > Allan > > --------------------- > Allan Hayes > Mathematica Training and Consulting > Leicester UK > www.haystack.demon.co.uk > hay at haystack.demon.co.uk > Voice: +44 (0)116 271 4198 > Fax: +44 (0)870 164 0565