Re: Determining a valid URL
- To: mathgroup at smc.vnet.net
- Subject: [mg112741] Re: Determining a valid URL
- From: "Hans Michel" <hmichel at cox.net>
- Date: Wed, 29 Sep 2010 04:12:24 -0400 (EDT)
To better understand the behavior of Import you should use all the available options. The Import function supports so many formats that if you already know the format you will be dealing with it is best to explicitly state it using the properties and options of Import. For example, you can use Import to get data from an Excel spreadsheet, through http. So look up $ImportFormats. Use Import["MyURL", "YourSupportedImportFormatExplicitlyStated"] When importing HTML if the web data served up is not well formed XHTML (or HTML) then the XMLObject will be Null so "Plaintext" which is derived from a successful XMLObject will also fail. Whereas Source is just the stream as it is consumed by Mathematica. When the XMLObject fails you are on your own for parsing plaintext. The plaintext is from the XMLObject's innerText property and should always be checked. Sometimes you only get what could be parsed into the XMLObject. You should consider using J/Link to call in a java HTML DOM parsers and get the innerText the DOM. Mathematica is making use of its XML capabilities to try parse HTML. Hans -----Original Message----- From: Mark Coleman [mailto:markspcoleman at gmail.com] Sent: Tuesday, September 28, 2010 5:02 AM To: mathgroup at smc.vnet.net Subject: [mg112741] [mg112707] Determining a valid URL Greetings, I am exploring the use of Mathematica (v7) to import information from web sites. One step in my procedure is to validate that a given web address is valid. One simple way that seems to work most of the time is to use the Import["my url"] command. An invalid web site (e.g., mispelled URL, etc) returns a $Failed. But I've found instances where valid web sites also return a $Failed if I issue the Import command with no argument. For these sites, Import seems to work if I use "Source". I'm trying to understand this behavior better, so I can write a more general URL checker (note: I am seeking to extract "Plaintext" from these sites, and using "Plaintext" as the second argument in the Import statement also yields a $Failed!) Best, Mark