Re: Using mathematica to read website
- To: mathgroup at smc.vnet.net
- Subject: [mg110365] Re: Using mathematica to read website
- From: "Hans Michel" <hmichel at cox.net>
- Date: Tue, 15 Jun 2010 02:30:28 -0400 (EDT)
- References: <hv23ul$5hl$1@smc.vnet.net>
Try ReadList[StringToStream[ StringReplace[Import["http://www.bloomberg.com/", "Source"], RegularExpression["<(.|\\n)*?>"] -> " "]], Word, WordSeparators -> {" ", "\t", "\n"}] This particular page does not parse well as Plaintext. Even the XMLObject is missing the body element. Thus Data, FullData, Hyperlinks, Plaintext are either blank or empty. In[1]:= Import["http://www.bloomberg.com/","Elements"] Out[1]= {Data,FullData,Hyperlinks,Plaintext,Source,Title,XMLObject} So read the source and try som brute force regex for the tags, and stream and parse result to a list by word. Hans "kevin" <kevin999koshy at gmail.com> wrote in message news:hv23ul$5hl$1 at smc.vnet.net... > Hi Guys, > > Is there any way to use mathematica to read all the words of a > website, say www.bloomberg.com? Thanks in advance. > > Best, > Kevin >