MathGroup Archive 2011

[Date Index] [Thread Index] [Author Index]

Search the Archive

Re: Importing into Mathematica from URL (PubMed)

  • To: mathgroup at smc.vnet.net
  • Subject: [mg117664] Re: Importing into Mathematica from URL (PubMed)
  • From: Armand Tamzarian <mike.honeychurch at gmail.com>
  • Date: Tue, 29 Mar 2011 06:56:19 -0500 (EST)
  • References: <imfadj$iu7$1@smc.vnet.net>

On Mar 24, 10:37 pm, Thomas Dowling <thomasgdowl... at gmail.com> wrote:
> Hello,
>
> Does anyone know how to import a protein sequence from the PubMed database
> into Mathematica, or
> can anyone advise me as to where I am going wrong in the following approach?
>
> 1. As an example, I'd like to import the data for BSA (bovine serum album in)
> from the following site:
>
> http://www.ncbi.nlm.nih.gov/protein/CAA76847.1
>
> I wish to import all meaningful data from this page, but the bit I am
> particularly interested in is the amino acid sequence
>
> (in one-leter code) which is right at the end (between ORIGIN and //):
>
> ORIGIN
>         1 mkwvtfisll llfssaysrg vfrrdthkse iahrfkdlge ehfkglvlia =
fsqylqqcpf
>        61 dehvklvnel tefaktcvad eshagceksl htlfgdelck vaslretygd =
madccekqep
>       121 ernecflshk ddspdlpklk pdpntlcdef kadekkfwgk ylyeiarrhp yf=
yapellyy
>       181 ankyngvfqe ccqaedkgac llpkietmre kvltssarqr lrcasiqkfg er=
alkawsva
>       241 rlsqkfpkae fvevtklvtd ltkvhkecch gdllecaddr adlakyicdn qd=
tissklke
>       301 ccdkplleks hciaevekda ipenlpplta dfaedkdvck nyqeakdafl gs=
flyeysrr
>       361 hpeyavsvll rlakeyeatl eeccakddph acystvfdkl khlvdepqnl ik=
qncdqfek
>       421 lgeygfqnal ivrytrkvpq vstptlvevs rslgkvgtrc ctkpesermp ct=
edylslil
>       481 nrlcvlhekt pvsekvtkcc teslvnrrpc fsaltpdety vpkafdeklf tf=
hadictlp
>       541 dtekqikkqt alvellkhkp kateeqlktv menfvafvdk ccaaddkeac fa=
vegpklvv
>       601 stqtala
> //
>
> 2.  Import["http://www.ncbi.nlm.nih.gov/protein/CAA76847.1","Elements";]
>
> gives the following
>
> {Data,FullData,Hyperlinks,Images,ImageURLs,Plaintext,Source,Title,XMLObje=
ct}
>
> However,
>
> Import["http://www.ncbi.nlm.nih.gov/protein/CAA76847.1","Data";];
>
> only gives what looks like a load of rubbish, and there is NO SEQUENCE
>
> 3.  Trying in FASTA format (which is  supported by Mathematica)
>
> Import["http://www.ncbi.nlm.nih.gov/protein/3336842?report=fasta
> ","Elements"]
>
> gives
>
> {Data,FullData,Hyperlinks,Images,ImageURLs,Plaintext,Source,Title,XMLObje=
ct}
>
> but
>
> Import["http://www.ncbi.nlm.nih.gov/protein/3336842?report=fasta","Data=
"]
>
> also gives only rubbish.
>
>  (Changing the Element to FullData or Plaintext has no effect)
>
> 4. NO SUCH PROBLEMS OCCUR with Uniprot (Swiss-Prot).
>
> Import["http://www.uniprot.org/uniprot/P02769","Elements";]
>
> Import["http://www.uniprot.org/uniprot/P02769","Data";]
>
> (Note that the sequence is now imported)
>
> Or, BEST, using the FASTA format and importing from this site
>
> Flatten@Characters@StringReplace[Import["http://www.uniprot.org/uniprot/P=
02769.fasta", "Plaintext"] , Whitespace->
> ""]//Short
>
> giving
>
> {M,K,W,V,T,F,I,S,<<592>>,S,T,Q,T,A,L,A}
>
> which is where I'd like to get to.
>
> So my question is the following:  What is so unusual about the PubMed s=
ite,
> and what am I doing wrong in the approach I am taking?  It  would
> be a great advantage to me to be able to import from PubMed in the manner
> shown above for Uniport.
>
> Thanks for your help
>
> Tom Dowling


Import[ ...., "Data"] is looking for stuff in HTML tables. The PubMed
source is buried/hidden in DIV elements rather than being "visible" in
a Table. When you look at the source for the Uniprot the amino acid
sequence is contained in a table therefore the algorithms being used
within Import are able to "see" it. So it is not anything you are
doing wrong it is just the different way the underlying webpage code
is structured -- notwithstanding that the appearance may look the same
or similar in your browser.

Mike


  • Prev by Date: Re: Writing images from manipulate
  • Next by Date: Re: Writing images from manipulate
  • Previous by thread: Re: Importing into Mathematica from URL (PubMed)
  • Next by thread: Re: Importing into Mathematica from URL (PubMed)