Re: Import html
- To: mathgroup at smc.vnet.net
- Subject: [mg108470] Re: Import html
- From: "Hans Michel" <hmichel at cox.net>
- Date: Fri, 19 Mar 2010 02:47:45 -0500 (EST)
- References: <hnsrtt$5ks$1@smc.vnet.net>
This worked In[2]:= Import["http://www.paginegialle.it/ascensoriromamir.a.m", {"XHTML", "Hyperlinks"}] Out[2]= {http://www.paginegialle.it,http://www.paginegialle.it,http://www.paginegialle.it/index_numero.html,http://www.paginegialle.it/cat/pagine_gialle_naviga.html,http://www.paginegialle.it/index_video.html,/ascensoriromamir.a.m/segnala,/pg/cgi/vcf.cgi?cc=758980621&cl=1,http://www.paginegialle.it/ascensoriromamir.a.m,mailto:fra.mirabella at tiscali.it,http://www.paginegialle.it/ascensoriromamir.a.m/mappa,http://www.paginegialle.it/ascensoriromamir.a.m/fotoaerea,http://www.paginegialle.it/ascensoriromamir.a.m/percorso,/ascensoriromamir.a.m,/ascensoriromamir.a.m/contatto,,http://www.paginebianche.it/,http://www.tuttocitta.it/,http://www.paginegiallevisual.it/,http://www.paginegiallenav.it/,http://www.892424.it/,http://www.seat.it,http://www.europages.it/,http://www.seatconvoi.it/,http://www.convoimagazineseat.it/,http://www.seatcorporateuniversity.it/,http://www.giallopromo.it/,http://www.kompassitalia.it/,http://www.consodata.it/,http://www.Lineaffari.com/,http://www.118000.fr/,http:// www.11880.com/,http://www.thomsonlocal.com/,http://www.11811.es/,http://www.alberghieturismo.it/,http://www.jobville.it/,http://www.paginegialle.it/pg/extra/marchi/seat_protetti.html,http://www.paginegialle.it/pg/offertapgol/cgi/contatta.cgi,http://www.paginegialle.it/pg/extra/privacy.html,http://www.paginegialle.it/pg/extra/copyright/tutelacopyright.html} In[3]:= $Version Out[3]= 7.0 for Microsoft Windows (32-bit) (November 10, 2008) Since the extension to this file was not .htm, or .html and it included a SGML DOCTYPE declaration I don't think imported file was routed to the correct parser. Apprently the current link can be successfully parsed using {"HTML","XMLObject"}. Please note that helping the application/function abit by telling it what the file is helps such as the following: In[4]:= Import["http://www.paginegialle.it/ascensoriromamir.a.m", {"HTML", "Hyperlinks"}] Out[4]= {http://www.paginegialle.it,http://www.paginegialle.it,http://www.paginegialle.it/index_numero.html,http://www.paginegialle.it/cat/pagine_gialle_naviga.html,http://www.paginegialle.it/index_video.html,/ascensoriromamir.a.m/segnala,/pg/cgi/vcf.cgi?cc=758980621&cl=1,http://www.paginegialle.it/ascensoriromamir.a.m,mailto:fra.mirabella at tiscali.it,http://www.paginegialle.it/ascensoriromamir.a.m/mappa,http://www.paginegialle.it/ascensoriromamir.a.m/fotoaerea,http://www.paginegialle.it/ascensoriromamir.a.m/percorso,/ascensoriromamir.a.m,/ascensoriromamir.a.m/contatto,,http://www.paginebianche.it/,http://www.tuttocitta.it/,http://www.paginegiallevisual.it/,http://www.paginegiallenav.it/,http://www.892424.it/,http://www.seat.it,http://www.europages.it/,http://www.seatconvoi.it/,http://www.convoimagazineseat.it/,http://www.seatcorporateuniversity.it/,http://www.giallopromo.it/,http://www.kompassitalia.it/,http://www.consodata.it/,http://www.Lineaffari.com/,http://www.118000.fr/,http:// www.11880.com/,http://www.thomsonlocal.com/,http://www.11811.es/,http://www.alberghieturismo.it/,http://www.jobville.it/,http://www.paginegialle.it/pg/extra/marchi/seat_protetti.html,http://www.paginegialle.it/pg/offertapgol/cgi/contatta.cgi,http://www.paginegialle.it/pg/extra/privacy.html,http://www.paginegialle.it/pg/extra/copyright/tutelacopyright.html} Hans "Scipione Dal Ferro" <scipionedalferro at yahoo.it> wrote in message news:hnsrtt$5ks$1 at smc.vnet.net... > Hi there, > > I use Import to parse the hyperlinks of many similar html pages without > any problem, but for few pages (as for the example in the subject) it > fails. > More in detail, here the example with the result: > > In[1]:= Import["http://www.paginegialle.it/ascensoriromamir.a.m", > "Hyperlinks"] > > Read::readt: Invalid input found when reading <!DOCTYPE html PUBLIC > "-//W3C//DTD XHTML 1.0 Transitional//EN" > "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"> > from > C:\Users\scipione.dalferro\AppData\Local\Temp\mFA3E.tmp\ascensoriromamir.a.m. > >> > > Out[1]= $Failed > > The error messages states there's an invalid input; anyway the page can be > opened with a browser correctly. > > I tried changing the Element to "Source" or other, but with the same > result. > Similar pages work correctly, as this one for example: > > In[2]:=Import["http://www.paginegialle.it/esis", "Hyperlinks"] > > Hope u can help me to understand this issue. > > Thanks, > Scipione >