MathGroup Archive: September 2010 [00628]

[Date Index] [Thread Index] [Author Index]

Re: Determining a valid URL

To: mathgroup at smc.vnet.net
Subject: [mg112741] Re: Determining a valid URL
From: "Hans Michel" <hmichel at cox.net>
Date: Wed, 29 Sep 2010 04:12:24 -0400 (EDT)

To better understand the behavior of Import you should use all the available
options. The Import function supports so many formats that if you already
know the format you will be dealing with it is best to explicitly state it
using the properties and options of Import.

For example, you can use Import to get data from an Excel spreadsheet,
through http.

So look up $ImportFormats. 

Use Import["MyURL", "YourSupportedImportFormatExplicitlyStated"]

When importing HTML if the web data served up is not well formed XHTML (or
HTML) then the XMLObject will be Null so "Plaintext" which is derived from a
successful XMLObject will also fail. Whereas Source is just the stream as it
is consumed by Mathematica.

When the XMLObject fails you are on your own for parsing plaintext.

The plaintext is from the XMLObject's innerText property and should always
be checked. Sometimes you only get what could be parsed into the XMLObject. 

You should consider using J/Link to call in a java HTML DOM parsers and get
the innerText the DOM. 

Mathematica is making use of its XML capabilities to try parse HTML.

Hans
-----Original Message-----
From: Mark Coleman [mailto:markspcoleman at gmail.com] 
Sent: Tuesday, September 28, 2010 5:02 AM
To: mathgroup at smc.vnet.net
Subject: [mg112741] [mg112707] Determining a valid URL

Greetings,

I am exploring the use of  Mathematica (v7) to import information from web
sites. One step in my procedure is to validate that a given web
address is valid. One simple way that seems to work most of the time
is to use the Import["my url"] command. An invalid web site (e.g.,
mispelled URL, etc) returns a $Failed. But I've found instances where
valid web sites also return a $Failed if I issue the Import command
with no argument. For these sites, Import seems to work if I use
"Source".

I'm trying to understand this behavior better, so I can write a more
general URL checker (note: I am seeking to extract "Plaintext" from
these sites, and using "Plaintext" as the second argument in the
Import statement also yields a $Failed!)

Best,

Mark

Prev by Date: Re: Graphics3D without perspective

Next by Date: Re: How do I test for existence of a list element? Clarified

Previous by thread: Re: Determining a valid URL

Next by thread: Defining a function using common notation for absolute value (not Abs[x])