MathGroup Archive 2010

[Date Index] [Thread Index] [Author Index]

Search the Archive

Re: Using mathematica to read website

  • To: mathgroup at smc.vnet.net
  • Subject: [mg110365] Re: Using mathematica to read website
  • From: "Hans Michel" <hmichel at cox.net>
  • Date: Tue, 15 Jun 2010 02:30:28 -0400 (EDT)
  • References: <hv23ul$5hl$1@smc.vnet.net>

Try

ReadList[StringToStream[
StringReplace[Import["http://www.bloomberg.com/";, "Source"],
RegularExpression["<(.|\\n)*?>"] -> " "]], Word,
WordSeparators -> {" ", "\t", "\n"}]

This particular page does not parse well as Plaintext. Even the XMLObject is 
missing the body element. Thus Data, FullData, Hyperlinks, Plaintext are 
either blank or empty.

In[1]:= Import["http://www.bloomberg.com/","Elements";]
Out[1]= {Data,FullData,Hyperlinks,Plaintext,Source,Title,XMLObject}

So read the source and try som brute force regex for the tags, and stream 
and parse result to a list by word.

Hans
"kevin" <kevin999koshy at gmail.com> wrote in message 
news:hv23ul$5hl$1 at smc.vnet.net...
> Hi Guys,
>
>      Is there any way to use mathematica to read all the words of a
> website, say www.bloomberg.com? Thanks in advance.
>
> Best,
> Kevin
> 



  • Prev by Date: Re: WORKBENCH VS MATHEMATICA
  • Next by Date: Reading Binary Data from SQL Request
  • Previous by thread: Re: Using mathematica to read website
  • Next by thread: Re: Using mathematica to read website