Re: Protein Sequence Alignment efficiency
- To: mathgroup at smc.vnet.net
- Subject: [mg118932] Re: Protein Sequence Alignment efficiency
- From: leigh pascoe <leigh at evry.inserm.fr>
- Date: Mon, 16 May 2011 07:28:16 -0400 (EDT)
Mathematica uses the Needleman-Wunsch algorithm, common to many bioinformatics programs that carry out sequence alignment. Leigh Le 15/05/2011 13:04, James Stein a =C3=A9crit : > I've forgotten most of what I used to know in this area (the early days of > shotgun sequencing), but I recall that, back in the '90's, there were many > algorithms for aligning DNA and, I presume, proteins. (Most of my work was > with DNA, and aligned protein sequences derived from the aligned DNA). Knuth > and others developed nifty algorithms for sequence alignment, some based on > a metric defining the "distance" between two sequences, others base on the > minimal number of atomic changes needed to convert one sequence to the > other. It is not at all clear to me that Mathematica's sequence alignment > algorithm, whatever it is, would be appropriate for bio nformatics. > > On Fri, May 13, 2011 at 3:26 AM, Matteo Pendleton<znfinger at gmail.com>wrote: > >> I'm trying to do some bioinformatics work in Mathematica and I've run up >> against a bit of a roadblock regarding code efficiency. I'm doing pairwise >> protein sequence alignments and I've written a nice little function that >> takes two sequences and returns the optimal alignment. The trouble is that >> it's slow. The reason >> it's slow is because changing the scoring table to the proper "BLOSUM80" >> slows the operation down horribly. >> >> Assuming: >> >> seqa = >> >> "QVQLVQSGAEVKKPGSSVKVSCKASGGTFSSYAISWVRQAPGQGLEWMGGIIPIFGTANYAQKFQGRVTITADESTSTAYMELSSLRSEDTAVYYCARDLETTVVTIYFDYWGQGTLVTVSS"; >> seqb = >> >> "QVQLVQSGAEVKKPGASVKVSCKASGYTFTGYYMHWVRQAPGQGLEWMGRINPNSGGTNYAQKFQGRVTSTRDTSISTAYMELSRLRSDDTVVYYCARDLRRFGGVPYYFDYWGQGTLVTVSS" >> >> While: >> Timing[SequenceAlignment[seqa , seqb , MergeDifferences -> False];] >> >> Out[1]={0., Null} >> >> Changing the scoring table results in this: >> >> Timing[SequenceAlignment[ seqa , seqb , SimilarityRules -> "BLOSUM80" , >> MergeDifferences -> False];] >> >> Out[1]={0.171, Null} >> >> ..and my two strings are very similar. Is there any way to optimize the >> SequenceAlignment function so that it doesn't do this or would it be better >> to create a specialized alignment function based on the underlying linear >> programming so that the scoring table is built in? I'd like to be able to >> run millions of sequences through this function and that's not going to be >> practical if I can only do 5/sec. >> >> Thanks in advance! >>