Protein Sequence Alignment efficiency
- To: mathgroup at smc.vnet.net
- Subject: [mg118852] Protein Sequence Alignment efficiency
- From: Matteo Pendleton <znfinger at gmail.com>
- Date: Fri, 13 May 2011 06:26:47 -0400 (EDT)
I'm trying to do some bioinformatics work in Mathematica and I've run up against a bit of a roadblock regarding code efficiency. I'm doing pairwise protein sequence alignments and I've written a nice little function that takes two sequences and returns the optimal alignment. The trouble is that it's slow. The reason it's slow is because changing the scoring table to the proper "BLOSUM80" slows the operation down horribly. Assuming: seqa = "QVQLVQSGAEVKKPGSSVKVSCKASGGTFSSYAISWVRQAPGQGLEWMGGIIPIFGTANYAQKFQGRVTITADESTSTAYMELSSLRSEDTAVYYCARDLETTVVTIYFDYWGQGTLVTVSS"; seqb = "QVQLVQSGAEVKKPGASVKVSCKASGYTFTGYYMHWVRQAPGQGLEWMGRINPNSGGTNYAQKFQGRVTSTRDTSISTAYMELSRLRSDDTVVYYCARDLRRFGGVPYYFDYWGQGTLVTVSS" While: Timing[SequenceAlignment[seqa , seqb , MergeDifferences -> False];] Out[1]={0., Null} Changing the scoring table results in this: Timing[SequenceAlignment[ seqa , seqb , SimilarityRules -> "BLOSUM80" , MergeDifferences -> False];] Out[1]={0.171, Null} ..and my two strings are very similar. Is there any way to optimize the SequenceAlignment function so that it doesn't do this or would it be better to create a specialized alignment function based on the underlying linear programming so that the scoring table is built in? I'd like to be able to run millions of sequences through this function and that's not going to be practical if I can only do 5/sec. Thanks in advance!