MathGroup Archive: May 2011 [00265]

[Date Index] [Thread Index] [Author Index]

Re: Protein Sequence Alignment efficiency

To: mathgroup at smc.vnet.net
Subject: [mg118906] Re: Protein Sequence Alignment efficiency
From: James Stein <mathgroup at stein.org>
Date: Sun, 15 May 2011 07:04:49 -0400 (EDT)

I've forgotten most of what I used to know in this area (the early days of
shotgun sequencing), but I recall that, back in the '90's, there were many
algorithms for aligning DNA and, I presume, proteins. (Most of my work was
with DNA, and aligned protein sequences derived from the aligned DNA). Knuth
and others developed nifty algorithms for sequence alignment, some based on
a metric defining the "distance" between two sequences, others base on the
minimal number of atomic changes needed to convert one sequence to the
other. It is not at all clear to me that Mathematica's sequence alignment
algorithm, whatever it is, would be appropriate for bio nformatics.

On Fri, May 13, 2011 at 3:26 AM, Matteo Pendleton <znfinger at gmail.com>wrote:

> I'm trying to do some bioinformatics work in Mathematica and I've run up
> against a bit of a roadblock regarding code efficiency. I'm doing pairwise
> protein sequence alignments and I've written a nice little function that
> takes two sequences and returns the optimal alignment. The trouble is that
> it's slow. The reason
> it's slow is because changing the scoring table to the proper "BLOSUM80"
> slows the operation down horribly.
>
> Assuming:
>
> seqa =
>
> "QVQLVQSGAEVKKPGSSVKVSCKASGGTFSSYAISWVRQAPGQGLEWMGGIIPIFGTANYAQKFQGRVTITADESTSTAYMELSSLRSEDTAVYYCARDLETTVVTIYFDYWGQGTLVTVSS";
> seqb =
>
> "QVQLVQSGAEVKKPGASVKVSCKASGYTFTGYYMHWVRQAPGQGLEWMGRINPNSGGTNYAQKFQGRVTSTRDTSISTAYMELSRLRSDDTVVYYCARDLRRFGGVPYYFDYWGQGTLVTVSS"
>
> While:
> Timing[SequenceAlignment[seqa , seqb , MergeDifferences -> False];]
>
> Out[1]={0., Null}
>
> Changing the scoring table results in this:
>
> Timing[SequenceAlignment[ seqa , seqb , SimilarityRules -> "BLOSUM80" ,
> MergeDifferences -> False];]
>
> Out[1]={0.171, Null}
>
> ..and my two strings are very similar. Is there any way to optimize the
> SequenceAlignment function so that it doesn't do this or would it be better
> to create a specialized alignment function based on the underlying linear
> programming so that the scoring table is built in? I'd like to be able to
> run millions of sequences through this function and that's not going to be
> practical if I can only do 5/sec.
>
> Thanks in advance!
>

Prev by Date: Re: Seeing a solution of a differential equation as it run, using EvaluationMonitor

Next by Date: Re: Maintaining the order of terms when adding symbolic expressions

Previous by thread: Re: Protein Sequence Alignment efficiency

Next by thread: Re: Protein Sequence Alignment efficiency