MathGroup Archive: May 2011 [00215]

[Date Index] [Thread Index] [Author Index]

Protein Sequence Alignment efficiency

To: mathgroup at smc.vnet.net
Subject: [mg118852] Protein Sequence Alignment efficiency
From: Matteo Pendleton <znfinger at gmail.com>
Date: Fri, 13 May 2011 06:26:47 -0400 (EDT)

I'm trying to do some bioinformatics work in Mathematica and I've run up
against a bit of a roadblock regarding code efficiency. I'm doing pairwise
protein sequence alignments and I've written a nice little function that
takes two sequences and returns the optimal alignment. The trouble is that
it's slow. The reason
it's slow is because changing the scoring table to the proper "BLOSUM80"
slows the operation down horribly.

Assuming:

seqa =
"QVQLVQSGAEVKKPGSSVKVSCKASGGTFSSYAISWVRQAPGQGLEWMGGIIPIFGTANYAQKFQGRVTITADESTSTAYMELSSLRSEDTAVYYCARDLETTVVTIYFDYWGQGTLVTVSS";
seqb =
"QVQLVQSGAEVKKPGASVKVSCKASGYTFTGYYMHWVRQAPGQGLEWMGRINPNSGGTNYAQKFQGRVTSTRDTSISTAYMELSRLRSDDTVVYYCARDLRRFGGVPYYFDYWGQGTLVTVSS"

While:
Timing[SequenceAlignment[seqa , seqb , MergeDifferences -> False];]

Out[1]={0., Null}

Changing the scoring table results in this:

Timing[SequenceAlignment[ seqa , seqb , SimilarityRules -> "BLOSUM80" ,
MergeDifferences -> False];]

Out[1]={0.171, Null}

..and my two strings are very similar. Is there any way to optimize the
SequenceAlignment function so that it doesn't do this or would it be better
to create a specialized alignment function based on the underlying linear
programming so that the scoring table is built in? I'd like to be able to
run millions of sequences through this function and that's not going to be
practical if I can only do 5/sec.

Thanks in advance!

Prev by Date: Re: Notebooks windows lost bug

Next by Date: How to use Manipulate->Snapshot output for further calculations?

Previous by thread: Re: Options, OptionsPattern, and OptionsValues

Next by thread: Re: Protein Sequence Alignment efficiency