MathGroup Archive 2011

[Date Index] [Thread Index] [Author Index]

Search the Archive

Re: Importing into Mathematica from URL (PubMed)

  • To: mathgroup at smc.vnet.net
  • Subject: [mg117745] Re: Importing into Mathematica from URL (PubMed)
  • From: telefunkenvf14 <rgorka at gmail.com>
  • Date: Wed, 30 Mar 2011 04:18:14 -0500 (EST)
  • References: <imsgvp$5do$1@smc.vnet.net>

On Mar 29, 6:49 am, "Hans Michel" <hmic... at cox.net> wrote:
> I don't know the field but the problem you are experiencing has nothing to
> do with mathematica or one site providing better data than another.
> This just returned the FASTA text amino acid sequence:
>
> Import["http://www.ncbi.nlm.nih.gov/sviewer/viewer.fcgi?tool=portal&sendto=o
> n&log$=seqview&db=protein&dopt=fasta&val=3336842&extrafeat=0&maxplex=1",
> "Text"]
> Or
> Import["http://www.ncbi.nlm.nih.gov/sviewer/viewer.fcgi?tool=portal&sendto=o
> n&log$=seqview&db=protein&dopt=fasta&val=3336842&extrafeat=0&maxplex=1",
> "FASTA"] for straight to "FASTA" format import.
>
> This returned the GenPept full format
>
> Import["http://www.ncbi.nlm.nih.gov/sviewer/viewer.fcgi?tool=portal&sendto=o
> n&log$=seqview&db=protein&dopt=gpwithparts&val=3336842&extrafeat=0&maxplex= 1
> ", "Text"]
>
> This site needs to run in a browser, what you see in the browser seems to be
> code that is generated through javascript. Mathematica only imports the raw
> sources not not the final source when viewed through a web browser with
> javascript engine on.
>
> If you view your original site through a web browser "http://www.ncbi.nlm.nih.gov/protein/CAA76847.1"; the "Send to:" link and drop
> down image will give you the opportunity to save what you are viewing to
> file. I just copied the representative URL from the save to file result.
> This did not involve mathematica in any way.
>
> You may also get this in xml format if you play around with the URL
> viewr.fcgi? query strings
>
> $Version
> "7.0 for Microsoft Windows (32-bit) (February 18, 2009)"
>
> Hans
>
>
>
>
>
>
>
> -----Original Message-----
> From: Thomas Dowling [mailto:thomasgdowl... at gmail.com]
> Sent: Thursday, March 24, 2011 6:27 AM
> Subject:  Importing into Mathematica from URL (PubMed)
>
> Hello,
>
> Does anyone know how to import a protein sequence from the PubMed database
> into Mathematica, or
> can anyone advise me as to where I am going wrong in the following approach?
>
> 1. As an example, I'd like to import the data for BSA (bovine serum album in)
> from the following site:
>
> http://www.ncbi.nlm.nih.gov/protein/CAA76847.1
>
> I wish to import all meaningful data from this page, but the bit I am
> particularly interested in is the amino acid sequence
>
> (in one-leter code) which is right at the end (between ORIGIN and //):
>
> ORIGIN
>         1 mkwvtfisll llfssaysrg vfrrdthkse iahrfkdlge ehfkglvliafsqylqqcpf
>        61 dehvklvnel tefaktcvad eshagceksl htlfgdelck vaslretygdmadccekqep
>       121 ernecflshk ddspdlpklk pdpntlcdef kadekkfwgk ylyeiarrhp yfyapellyy
>       181 ankyngvfqe ccqaedkgac llpkietmre kvltssarqr lrcasiqkfg eralkawsva
>       241 rlsqkfpkae fvevtklvtd ltkvhkecch gdllecaddr adlakyicdn qdtissklke
>       301 ccdkplleks hciaevekda ipenlpplta dfaedkdvck nyqeakdafl gsflyeysrr
>       361 hpeyavsvll rlakeyeatl eeccakddph acystvfdkl khlvdepqnl ikqncdqfek
>       421 lgeygfqnal ivrytrkvpq vstptlvevs rslgkvgtrc ctkpesermp ctedylslil
>       481 nrlcvlhekt pvsekvtkcc teslvnrrpc fsaltpdety vpkafdeklf tfhadictlp
>       541 dtekqikkqt alvellkhkp kateeqlktv menfvafvdk ccaaddkeac favegpklvv
>       601 stqtala
> //
>
> 2.  Import["http://www.ncbi.nlm.nih.gov/protein/CAA76847.1","Elements";]
>
> gives the following
>
> {Data,FullData,Hyperlinks,Images,ImageURLs,Plaintext,Source,Title,XMLObject }
>
> However,
>
> Import["http://www.ncbi.nlm.nih.gov/protein/CAA76847.1","Data";];
>
> only gives what looks like a load of rubbish, and there is NO SEQUENCE
>
> 3.  Trying in FASTA format (which is  supported by Mathematica)
>
> Import["http://www.ncbi.nlm.nih.gov/protein/3336842?report=fasta
> ","Elements"]
>
> gives
>
> {Data,FullData,Hyperlinks,Images,ImageURLs,Plaintext,Source,Title,XMLObject }
>
> but
>
> Import["http://www.ncbi.nlm.nih.gov/protein/3336842?report=fasta","Data";]
>
> also gives only rubbish.
>
>  (Changing the Element to FullData or Plaintext has no effect)
>
> 4. NO SUCH PROBLEMS OCCUR with Uniprot (Swiss-Prot).
>
> Import["http://www.uniprot.org/uniprot/P02769","Elements";]
>
> Import["http://www.uniprot.org/uniprot/P02769","Data";]
>
> (Note that the sequence is now imported)
>
> Or, BEST, using the FASTA format and importing from this site
>
> Flatten@Characters@StringReplace[Import["http://www.uniprot.org/uniprot/P02769.fasta";, "Plaintext"] , Whitespace->
> ""]//Short
>
> giving
>
> {M,K,W,V,T,F,I,S,<<592>>,S,T,Q,T,A,L,A}
>
> which is where I'd like to get to.
>
> So my question is the following:  What is so unusual about the PubMed site,
> and what am I doing wrong in the approach I am taking?  It  would
> be a great advantage to me to be able to import from PubMed in the manner
> shown above for Uniport.
>
> Thanks for your help
>
> Tom Dowling

Tom (and others):

I played around with this and discovered an API of sorts for the site.
While I'm not sure of *exactly* what you want to grab from the page,
consider the following:

Import["http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?
db=nuccore&id=CAA76847.1&&rettype=fasta&retmode=text","Text"]

I discovered this trick by digging for info on 'url encoding',
specifically the following (which I altered by removing the two
numbers they provided --- along with the comma separating them -- and
then plugging in your ref number, or whatever the hell it means!):

http://www.ncbi.nlm.nih.gov/books/NBK25500/#chapter1.Downloading_Full_Records

"Downloading Full Records" is a section from a 'book' on E-Utilities
(their description):

(...The E-utilities use a fixed URL syntax that translates a standard
set of input parameters into the values necessary for various NCBI
software components to search for and retrieve the requested data. The
E-utilities are therefore the structured interface to the Entrez
system, which currently includes 38 databases covering a variety of
biomedical data, including nucleotide and protein sequences, gene
records, three-dimensional molecular structures, and the biomedical
literature.)

Hope this helps. Most data sites have some sort of URL encoding API
these days, a good thing for Mathematica users. :D

-RG


  • Prev by Date: Re: "set" data structure in Mathematica? (speeding up
  • Next by Date: Re: Why Mathematica does not issue a warning when the calculations
  • Previous by thread: Re: Importing into Mathematica from URL (PubMed)
  • Next by thread: create reusable graph style?