Re: Importing into Mathematica from URL (PubMed)
- To: mathgroup at smc.vnet.net
- Subject: [mg117745] Re: Importing into Mathematica from URL (PubMed)
- From: telefunkenvf14 <rgorka at gmail.com>
- Date: Wed, 30 Mar 2011 04:18:14 -0500 (EST)
- References: <imsgvp$5do$1@smc.vnet.net>
On Mar 29, 6:49 am, "Hans Michel" <hmic... at cox.net> wrote: > I don't know the field but the problem you are experiencing has nothing to > do with mathematica or one site providing better data than another. > This just returned the FASTA text amino acid sequence: > > Import["http://www.ncbi.nlm.nih.gov/sviewer/viewer.fcgi?tool=portal&sendto=o > n&log$=seqview&db=protein&dopt=fasta&val=3336842&extrafeat=0&maxplex=1", > "Text"] > Or > Import["http://www.ncbi.nlm.nih.gov/sviewer/viewer.fcgi?tool=portal&sendto=o > n&log$=seqview&db=protein&dopt=fasta&val=3336842&extrafeat=0&maxplex=1", > "FASTA"] for straight to "FASTA" format import. > > This returned the GenPept full format > > Import["http://www.ncbi.nlm.nih.gov/sviewer/viewer.fcgi?tool=portal&sendto=o > n&log$=seqview&db=protein&dopt=gpwithparts&val=3336842&extrafeat=0&maxplex= 1 > ", "Text"] > > This site needs to run in a browser, what you see in the browser seems to be > code that is generated through javascript. Mathematica only imports the raw > sources not not the final source when viewed through a web browser with > javascript engine on. > > If you view your original site through a web browser "http://www.ncbi.nlm.nih.gov/protein/CAA76847.1" the "Send to:" link and drop > down image will give you the opportunity to save what you are viewing to > file. I just copied the representative URL from the save to file result. > This did not involve mathematica in any way. > > You may also get this in xml format if you play around with the URL > viewr.fcgi? query strings > > $Version > "7.0 for Microsoft Windows (32-bit) (February 18, 2009)" > > Hans > > > > > > > > -----Original Message----- > From: Thomas Dowling [mailto:thomasgdowl... at gmail.com] > Sent: Thursday, March 24, 2011 6:27 AM > Subject: Importing into Mathematica from URL (PubMed) > > Hello, > > Does anyone know how to import a protein sequence from the PubMed database > into Mathematica, or > can anyone advise me as to where I am going wrong in the following approach? > > 1. As an example, I'd like to import the data for BSA (bovine serum album in) > from the following site: > > http://www.ncbi.nlm.nih.gov/protein/CAA76847.1 > > I wish to import all meaningful data from this page, but the bit I am > particularly interested in is the amino acid sequence > > (in one-leter code) which is right at the end (between ORIGIN and //): > > ORIGIN > 1 mkwvtfisll llfssaysrg vfrrdthkse iahrfkdlge ehfkglvliafsqylqqcpf > 61 dehvklvnel tefaktcvad eshagceksl htlfgdelck vaslretygdmadccekqep > 121 ernecflshk ddspdlpklk pdpntlcdef kadekkfwgk ylyeiarrhp yfyapellyy > 181 ankyngvfqe ccqaedkgac llpkietmre kvltssarqr lrcasiqkfg eralkawsva > 241 rlsqkfpkae fvevtklvtd ltkvhkecch gdllecaddr adlakyicdn qdtissklke > 301 ccdkplleks hciaevekda ipenlpplta dfaedkdvck nyqeakdafl gsflyeysrr > 361 hpeyavsvll rlakeyeatl eeccakddph acystvfdkl khlvdepqnl ikqncdqfek > 421 lgeygfqnal ivrytrkvpq vstptlvevs rslgkvgtrc ctkpesermp ctedylslil > 481 nrlcvlhekt pvsekvtkcc teslvnrrpc fsaltpdety vpkafdeklf tfhadictlp > 541 dtekqikkqt alvellkhkp kateeqlktv menfvafvdk ccaaddkeac favegpklvv > 601 stqtala > // > > 2. Import["http://www.ncbi.nlm.nih.gov/protein/CAA76847.1","Elements"] > > gives the following > > {Data,FullData,Hyperlinks,Images,ImageURLs,Plaintext,Source,Title,XMLObject } > > However, > > Import["http://www.ncbi.nlm.nih.gov/protein/CAA76847.1","Data"]; > > only gives what looks like a load of rubbish, and there is NO SEQUENCE > > 3. Trying in FASTA format (which is supported by Mathematica) > > Import["http://www.ncbi.nlm.nih.gov/protein/3336842?report=fasta > ","Elements"] > > gives > > {Data,FullData,Hyperlinks,Images,ImageURLs,Plaintext,Source,Title,XMLObject } > > but > > Import["http://www.ncbi.nlm.nih.gov/protein/3336842?report=fasta","Data"] > > also gives only rubbish. > > (Changing the Element to FullData or Plaintext has no effect) > > 4. NO SUCH PROBLEMS OCCUR with Uniprot (Swiss-Prot). > > Import["http://www.uniprot.org/uniprot/P02769","Elements"] > > Import["http://www.uniprot.org/uniprot/P02769","Data"] > > (Note that the sequence is now imported) > > Or, BEST, using the FASTA format and importing from this site > > Flatten@Characters@StringReplace[Import["http://www.uniprot.org/uniprot/P02769.fasta", "Plaintext"] , Whitespace-> > ""]//Short > > giving > > {M,K,W,V,T,F,I,S,<<592>>,S,T,Q,T,A,L,A} > > which is where I'd like to get to. > > So my question is the following: What is so unusual about the PubMed site, > and what am I doing wrong in the approach I am taking? It would > be a great advantage to me to be able to import from PubMed in the manner > shown above for Uniport. > > Thanks for your help > > Tom Dowling Tom (and others): I played around with this and discovered an API of sorts for the site. While I'm not sure of *exactly* what you want to grab from the page, consider the following: Import["http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi? db=nuccore&id=CAA76847.1&&rettype=fasta&retmode=text","Text"] I discovered this trick by digging for info on 'url encoding', specifically the following (which I altered by removing the two numbers they provided --- along with the comma separating them -- and then plugging in your ref number, or whatever the hell it means!): http://www.ncbi.nlm.nih.gov/books/NBK25500/#chapter1.Downloading_Full_Records "Downloading Full Records" is a section from a 'book' on E-Utilities (their description): (...The E-utilities use a fixed URL syntax that translates a standard set of input parameters into the values necessary for various NCBI software components to search for and retrieve the requested data. The E-utilities are therefore the structured interface to the Entrez system, which currently includes 38 databases covering a variety of biomedical data, including nucleotide and protein sequences, gene records, three-dimensional molecular structures, and the biomedical literature.) Hope this helps. Most data sites have some sort of URL encoding API these days, a good thing for Mathematica users. :D -RG