MathGroup Archive: May 2011 [00286]

[Date Index] [Thread Index] [Author Index]

Re: Protein Sequence Alignment efficiency

To: mathgroup at smc.vnet.net
Subject: [mg118932] Re: Protein Sequence Alignment efficiency
From: leigh pascoe <leigh at evry.inserm.fr>
Date: Mon, 16 May 2011 07:28:16 -0400 (EDT)

Mathematica uses the Needleman-Wunsch algorithm, common to many
bioinformatics programs that carry out sequence alignment.

Leigh

Le 15/05/2011 13:04, James Stein a =C3=A9crit :
> I've forgotten most of what I used to know in this area (the early days of
> shotgun sequencing), but I recall that, back in the '90's, there were many
> algorithms for aligning DNA and, I presume, proteins. (Most of my work was
> with DNA, and aligned protein sequences derived from the aligned DNA). Knuth
> and others developed nifty algorithms for sequence alignment, some based on
> a metric defining the "distance" between two sequences, others base on the
> minimal number of atomic changes needed to convert one sequence to the
> other. It is not at all clear to me that Mathematica's sequence alignment
> algorithm, whatever it is, would be appropriate for bio nformatics.
>
> On Fri, May 13, 2011 at 3:26 AM, Matteo Pendleton<znfinger at gmail.com>wrote:
>
>> I'm trying to do some bioinformatics work in Mathematica and I've run up
>> against a bit of a roadblock regarding code efficiency. I'm doing pairwise
>> protein sequence alignments and I've written a nice little function that
>> takes two sequences and returns the optimal alignment. The trouble is that
>> it's slow. The reason
>> it's slow is because changing the scoring table to the proper "BLOSUM80"
>> slows the operation down horribly.
>>
>> Assuming:
>>
>> seqa =
>>
>> "QVQLVQSGAEVKKPGSSVKVSCKASGGTFSSYAISWVRQAPGQGLEWMGGIIPIFGTANYAQKFQGRVTITADESTSTAYMELSSLRSEDTAVYYCARDLETTVVTIYFDYWGQGTLVTVSS";
>> seqb =
>>
>> "QVQLVQSGAEVKKPGASVKVSCKASGYTFTGYYMHWVRQAPGQGLEWMGRINPNSGGTNYAQKFQGRVTSTRDTSISTAYMELSRLRSDDTVVYYCARDLRRFGGVPYYFDYWGQGTLVTVSS"
>>
>> While:
>> Timing[SequenceAlignment[seqa , seqb , MergeDifferences ->  False];]
>>
>> Out[1]={0., Null}
>>
>> Changing the scoring table results in this:
>>
>> Timing[SequenceAlignment[ seqa , seqb , SimilarityRules ->  "BLOSUM80" ,
>> MergeDifferences ->  False];]
>>
>> Out[1]={0.171, Null}
>>
>> ..and my two strings are very similar. Is there any way to optimize the
>> SequenceAlignment function so that it doesn't do this or would it be better
>> to create a specialized alignment function based on the underlying linear
>> programming so that the scoring table is built in? I'd like to be able to
>> run millions of sequences through this function and that's not going to be
>> practical if I can only do 5/sec.
>>
>> Thanks in advance!
>>

Prev by Date: Plot image changes size during animation

Next by Date: TeXForm: slight change in behaviour from version 7 to version 8

Previous by thread: Re: Protein Sequence Alignment efficiency

Next by thread: How to use Manipulate->Snapshot output for further calculations?