MathGroup Archive 2010

[Date Index] [Thread Index] [Author Index]

Search the Archive

Re: Import html

  • To: mathgroup at smc.vnet.net
  • Subject: [mg108470] Re: Import html
  • From: "Hans Michel" <hmichel at cox.net>
  • Date: Fri, 19 Mar 2010 02:47:45 -0500 (EST)
  • References: <hnsrtt$5ks$1@smc.vnet.net>

This worked

In[2]:= Import["http://www.paginegialle.it/ascensoriromamir.a.m";, {"XHTML", 
"Hyperlinks"}]
Out[2]= 
{http://www.paginegialle.it,http://www.paginegialle.it,http://www.paginegialle.it/index_numero.html,http://www.paginegialle.it/cat/pagine_gialle_naviga.html,http://www.paginegialle.it/index_video.html,/ascensoriromamir.a.m/segnala,/pg/cgi/vcf.cgi?cc=758980621&cl=1,http://www.paginegialle.it/ascensoriromamir.a.m,mailto:fra.mirabella at tiscali.it,http://www.paginegialle.it/ascensoriromamir.a.m/mappa,http://www.paginegialle.it/ascensoriromamir.a.m/fotoaerea,http://www.paginegialle.it/ascensoriromamir.a.m/percorso,/ascensoriromamir.a.m,/ascensoriromamir.a.m/contatto,,http://www.paginebianche.it/,http://www.tuttocitta.it/,http://www.paginegiallevisual.it/,http://www.paginegiallenav.it/,http://www.892424.it/,http://www.seat.it,http://www.europages.it/,http://www.seatconvoi.it/,http://www.convoimagazineseat.it/,http://www.seatcorporateuniversity.it/,http://www.giallopromo.it/,http://www.kompassitalia.it/,http://www.consodata.it/,http://www.Lineaffari.com/,http://www.118000.fr/,http://
 www.11880.com/,http://www.thomsonlocal.com/,http://www.11811.es/,http://www.alberghieturismo.it/,http://www.jobville.it/,http://www.paginegialle.it/pg/extra/marchi/seat_protetti.html,http://www.paginegialle.it/pg/offertapgol/cgi/contatta.cgi,http://www.paginegialle.it/pg/extra/privacy.html,http://www.paginegialle.it/pg/extra/copyright/tutelacopyright.html}

In[3]:= $Version
Out[3]= 7.0 for Microsoft Windows (32-bit) (November 10, 2008)

Since the extension to this file was not .htm, or .html and it included a 
SGML DOCTYPE declaration I don't think imported file was routed to the 
correct parser. Apprently the current link can be successfully parsed using 
{"HTML","XMLObject"}.

Please note that helping the application/function abit by telling it what 
the file is helps such as the following:

In[4]:= Import["http://www.paginegialle.it/ascensoriromamir.a.m";, {"HTML", 
"Hyperlinks"}]
Out[4]= 
{http://www.paginegialle.it,http://www.paginegialle.it,http://www.paginegialle.it/index_numero.html,http://www.paginegialle.it/cat/pagine_gialle_naviga.html,http://www.paginegialle.it/index_video.html,/ascensoriromamir.a.m/segnala,/pg/cgi/vcf.cgi?cc=758980621&cl=1,http://www.paginegialle.it/ascensoriromamir.a.m,mailto:fra.mirabella at tiscali.it,http://www.paginegialle.it/ascensoriromamir.a.m/mappa,http://www.paginegialle.it/ascensoriromamir.a.m/fotoaerea,http://www.paginegialle.it/ascensoriromamir.a.m/percorso,/ascensoriromamir.a.m,/ascensoriromamir.a.m/contatto,,http://www.paginebianche.it/,http://www.tuttocitta.it/,http://www.paginegiallevisual.it/,http://www.paginegiallenav.it/,http://www.892424.it/,http://www.seat.it,http://www.europages.it/,http://www.seatconvoi.it/,http://www.convoimagazineseat.it/,http://www.seatcorporateuniversity.it/,http://www.giallopromo.it/,http://www.kompassitalia.it/,http://www.consodata.it/,http://www.Lineaffari.com/,http://www.118000.fr/,http://
 www.11880.com/,http://www.thomsonlocal.com/,http://www.11811.es/,http://www.alberghieturismo.it/,http://www.jobville.it/,http://www.paginegialle.it/pg/extra/marchi/seat_protetti.html,http://www.paginegialle.it/pg/offertapgol/cgi/contatta.cgi,http://www.paginegialle.it/pg/extra/privacy.html,http://www.paginegialle.it/pg/extra/copyright/tutelacopyright.html}

Hans

"Scipione Dal Ferro" <scipionedalferro at yahoo.it> wrote in message 
news:hnsrtt$5ks$1 at smc.vnet.net...
> Hi there,
>
> I use Import to parse the hyperlinks of many similar html pages without 
> any problem, but for few pages (as for the example in the subject) it 
> fails.
> More in detail, here the example with the result:
>
> In[1]:= Import["http://www.paginegialle.it/ascensoriromamir.a.m";, 
> "Hyperlinks"]
>
> Read::readt: Invalid input found when reading <!DOCTYPE html PUBLIC 
> "-//W3C//DTD XHTML 1.0 Transitional//EN" 
> "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd";>
> from 
> C:\Users\scipione.dalferro\AppData\Local\Temp\mFA3E.tmp\ascensoriromamir.a.m. 
>  >>
>
> Out[1]= $Failed
>
> The error messages states there's an invalid input; anyway the page can be 
> opened with a browser correctly.
>
> I tried changing the Element to "Source" or other, but with the same 
> result.
> Similar pages work correctly, as this one for example:
>
> In[2]:=Import["http://www.paginegialle.it/esis";, "Hyperlinks"]
>
> Hope u can help me to understand this issue.
>
> Thanks,
> Scipione
> 



  • Prev by Date: Re: Possible bug: Integrate[(u - t)*BesselY[0, 2*t], {t,
  • Next by Date: blobs functions
  • Previous by thread: Import html
  • Next by thread: Re: Import html