MathGroup Archive: March 2011 [00808]

[Date Index] [Thread Index] [Author Index]

Importing into Mathematica from URL (PubMed)

To: mathgroup at smc.vnet.net
Subject: [mg117594] Importing into Mathematica from URL (PubMed)
From: Thomas Dowling <thomasgdowling at gmail.com>
Date: Thu, 24 Mar 2011 06:27:17 -0500 (EST)

Hello,

Does anyone know how to import a protein sequence from the PubMed database
into Mathematica, or
can anyone advise me as to where I am going wrong in the following approach?


1. As an example, I'd like to import the data for BSA (bovine serum albumin)
from the following site:


http://www.ncbi.nlm.nih.gov/protein/CAA76847.1


I wish to import all meaningful data from this page, but the bit I am
particularly interested in is the amino acid sequence

(in one-leter code) which is right at the end (between ORIGIN and //):

ORIGIN
        1 mkwvtfisll llfssaysrg vfrrdthkse iahrfkdlge ehfkglvlia fsqylqqcpf
       61 dehvklvnel tefaktcvad eshagceksl htlfgdelck vaslretygd madccekqep
      121 ernecflshk ddspdlpklk pdpntlcdef kadekkfwgk ylyeiarrhp yfyapellyy
      181 ankyngvfqe ccqaedkgac llpkietmre kvltssarqr lrcasiqkfg eralkawsva
      241 rlsqkfpkae fvevtklvtd ltkvhkecch gdllecaddr adlakyicdn qdtissklke
      301 ccdkplleks hciaevekda ipenlpplta dfaedkdvck nyqeakdafl gsflyeysrr
      361 hpeyavsvll rlakeyeatl eeccakddph acystvfdkl khlvdepqnl ikqncdqfek
      421 lgeygfqnal ivrytrkvpq vstptlvevs rslgkvgtrc ctkpesermp ctedylslil
      481 nrlcvlhekt pvsekvtkcc teslvnrrpc fsaltpdety vpkafdeklf tfhadictlp
      541 dtekqikkqt alvellkhkp kateeqlktv menfvafvdk ccaaddkeac favegpklvv
      601 stqtala
//

2.  Import["http://www.ncbi.nlm.nih.gov/protein/CAA76847.1","Elements";]

gives the following

{Data,FullData,Hyperlinks,Images,ImageURLs,Plaintext,Source,Title,XMLObject}

However,

Import["http://www.ncbi.nlm.nih.gov/protein/CAA76847.1","Data";];

only gives what looks like a load of rubbish, and there is NO SEQUENCE

3.  Trying in FASTA format (which is  supported by Mathematica)

Import["http://www.ncbi.nlm.nih.gov/protein/3336842?report=fasta
","Elements"]

gives

{Data,FullData,Hyperlinks,Images,ImageURLs,Plaintext,Source,Title,XMLObject}

but

Import["http://www.ncbi.nlm.nih.gov/protein/3336842?report=fasta","Data";]

also gives only rubbish.

 (Changing the Element to FullData or Plaintext has no effect)


4. NO SUCH PROBLEMS OCCUR with Uniprot (Swiss-Prot).

Import["http://www.uniprot.org/uniprot/P02769","Elements";]

Import["http://www.uniprot.org/uniprot/P02769","Data";]

(Note that the sequence is now imported)

Or, BEST, using the FASTA format and importing from this site

Flatten@Characters@StringReplace[Import["
http://www.uniprot.org/uniprot/P02769.fasta";, "Plaintext"] , Whitespace->
""]//Short

giving

{M,K,W,V,T,F,I,S,<<592>>,S,T,Q,T,A,L,A}

which is where I'd like to get to.


So my question is the following:  What is so unusual about the PubMed site,
and what am I doing wrong in the approach I am taking?  It  would
be a great advantage to me to be able to import from PubMed in the manner
shown above for Uniport.


Thanks for your help

Tom Dowling

Prev by Date: Re: Problem with DateListPlot Aspect Ratio

Next by Date: Re: Problem with DateListPlot Aspect Ratio

Previous by thread: Re: Writing images from manipulate

Next by thread: Re: Importing into Mathematica from URL (PubMed)