MathGroup Archive: June 2010 [00238]

[Date Index] [Thread Index] [Author Index]

Re: Using mathematica to read website

To: mathgroup at smc.vnet.net
Subject: [mg110365] Re: Using mathematica to read website
From: "Hans Michel" <hmichel at cox.net>
Date: Tue, 15 Jun 2010 02:30:28 -0400 (EDT)
References: <hv23ul$5hl$1@smc.vnet.net>

Try

ReadList[StringToStream[
StringReplace[Import["http://www.bloomberg.com/";, "Source"],
RegularExpression["<(.|\\n)*?>"] -> " "]], Word,
WordSeparators -> {" ", "\t", "\n"}]

This particular page does not parse well as Plaintext. Even the XMLObject is 
missing the body element. Thus Data, FullData, Hyperlinks, Plaintext are 
either blank or empty.

In[1]:= Import["http://www.bloomberg.com/","Elements";]
Out[1]= {Data,FullData,Hyperlinks,Plaintext,Source,Title,XMLObject}

So read the source and try som brute force regex for the tags, and stream 
and parse result to a list by word.

Hans
"kevin" <kevin999koshy at gmail.com> wrote in message 
news:hv23ul$5hl$1 at smc.vnet.net...
> Hi Guys,
>
>      Is there any way to use mathematica to read all the words of a
> website, say www.bloomberg.com? Thanks in advance.
>
> Best,
> Kevin
>

Prev by Date: Re: WORKBENCH VS MATHEMATICA

Next by Date: Reading Binary Data from SQL Request

Previous by thread: Re: Using mathematica to read website

Next by thread: Re: Using mathematica to read website