MathGroup Archive: June 2010 [00248]

[Date Index] [Thread Index] [Author Index]
Re: Using mathematica to read website
To: mathgroup at smc.vnet.net
Subject: [mg110367] Re: Using mathematica to read website
From: Leonid Shifrin <lshifr at gmail.com>
Date: Tue, 15 Jun 2010 02:30:50 -0400 (EDT)
Hi Kevin,

I think that the notion of "words" is not sharply defined in this context,
since html is not a plain text file - you have words in main text, in links,
image descriptions and many other places.

I have written a simplistic HTML parser a couple of years ago for my own
purposes. I intended (and still do) to polish it, convert into one or
several packages and make available to anybody interested - but never had
time to do it. Your post prompted me to speed things up in this regard, and
as a first step, here is the code as it is now - may be you or someone else
will find it of some use, since it will still take me some time to bring it
to the proper package format.

-------------------------------------------------------------------------------------

ClearAll[listSplit, reconstructIntervals, groupElements,
  groupPositions, processPosList, groupElementsNested, getAllUsedTags,
   refineTags, getTagTitle, getPairedTags, makeTagReplaceRules,
  getTagNames, makeTagHashRules, tagSplit, splitText, preparse,
  openCloseEnumerate, getOpenCloseForm, makeTagDepthList,
  oneStepParse, tagProcess, openCloseProcess, documentParse,
  refineParsed, parseText, removeLeaves, attribContainer, postProcess];

listSplit[x_List, lengthlist_List, headlist_List] :=
  MapThread[#1 @@ Take[x, #2] &, {headlist,
    Transpose[{Most[#] + 1, Rest[#]} &[
      FoldList[Plus, 0, lengthlist]]]}];

reconstructIntervals[listlen_Integer, ints_List] :=
  Module[{missed, startint, lastint},
   startint  = If[ints[[1, 1]] == 1, {}, {1, ints[[1, 1]] - 1}];
   lastint =
    If[ints[[-1, -1]] == listlen, {}, {ints[[-1, -1]] + 1, listlen}];
   missed =
    Map[If[#[[2, 1]] - #[[1, 2]] >
        1, {#[[1, 2]] + 1, #[[2, 1]] - 1}, {}] &,
     Partition[ints, 2, 1]];
   missed = Join[missed, {lastint}];
   Prepend[Flatten[Transpose[{ints, missed}], 1], startint]];

groupElements[lst_List, poslist_List, headlist_List] /;
   And[OrderedQ[Flatten[Sort[poslist]]],
    Length[headlist] == Length[poslist]] :=
  Module[{totalheadlist, allints, llist},
   totalheadlist =
    Append[
     Flatten[
      Transpose[
       {Array[Sequence &, {Length[headlist]}], headlist}],
      1],
     Sequence];
   allints = reconstructIntervals[Length[lst], poslist];
   llist = Map[If[# === {}, 0, 1 - Subtract @@ #] &, allints];
   listSplit[lst, llist, totalheadlist]];

(* To work on general heads, we need this *)

groupElements[h_[x__], poslist_List, headlist_List] :=
  h[Sequence @@ groupElements[{x}, poslist, headlist]];

(* If we have a single head *)
groupElements[expr_, poslist_List, head_] :=
  groupElements[expr, poslist, Table[head, {Length[poslist]}]];


groupPositions[plist_List] :=
  Reap[Sow[Last[#], {Most[#]}] & /@ plist, _, List][[2]];


processPosList[{openlist_List, closelist_List}] :=
  Module[{opengroup, closegroup, poslist},
   {opengroup, closegroup} = groupPositions /@ {openlist, closelist} ;
   poslist =
    Transpose[Transpose[Sort[#]] & /@ {opengroup, closegroup}];
   If[UnsameQ @@ poslist[[1]],
    Return[(Print["Unmatched lists!", {openlist, closelist}]; {})],
    poslist =
     Transpose[{poslist[[1, 1]], Transpose /@ Transpose[poslist[[2]]]}]
    ]
   ];

groupElementsNested[nested_, {openposlist_List, closeposlist_List},
    head_] /; Head[head] =!= List :=
  Fold[Function[{x, y},
    MapAt[groupElements[#, y[[2]], head] &, x, {y[[1]]}]], nested,
   Sort[processPosList[{openposlist, closeposlist}],
    Length[#2[[1]]] < Length[#1[[1]]] &]];


(* Find all the tags in the document*)
getAllUsedTags[text_String] :=
  Module[{htmlTagsposlist, result, chars = Characters[text], x},
   htmlTagsposlist =
    StringPosition[text,
     ShortestMatch["<" ~~ x__ ~~ Whitespace | ">"], Overlaps -> True];
   result =
    Union[Map[ToLowerCase,
       StringJoin @@@
        Map[Take[chars, {#[[1]], #[[2]] - 1}] &,
         htmlTagsposlist]] /. {"<br/" :> "<br"}]];

(* get rid of tags with non-alphabetic characters *)
refineTags[tags_List] :=
  Module[{alphabet = Characters["abcdefghijklmnopqrstuvwxyz/"]},
   DeleteCases[tags,
    x_ /; ! MemberQ[alphabet, StringTake[x, {2, 2}]]]];

getTagTitle[tag_String] :=
  If[StringTake[tag, {2, 2}] === "/", StringDrop[tag, {2}], tag];

(* Find all paired tags *)
getPairedTags[tags_List] :=
  Reverse /@
   Select[Reap[Sow[#, getTagTitle[#]] & /@ tags, _, #2 &][[2]],
    Length[#] == 2 &];

(* Prepare string replacement rules to be used when we tokenize the \
string *)
makeTagReplaceRules[pairedtags_List, unpairedtags_List] :=
  Sort[Join[
    Rule @@@
     Map[{#, {StringDrop[#, 1], "Unpaired", "Open"}} &, unpairedtags],
     Rule @@@
     Flatten[Transpose[{#, {{StringDrop[#[[1]], 1],
            "Open"}, {StringDrop[#[[2]], 2], "Close"}}}] & /@
       pairedtags, 1], {Rule[">", {">", "UnpairedClose"}]}],
   StringLength[#1[[2, 1]]] > StringLength[#2[[2, 1]]] &];

getTagNames[pairedtags_List, unpairedtags_List] :=
  StringDrop[#, 1] & /@ Join[Transpose[pairedtags][[1]], unpairedtags];

makeTagHashRules[tagnames_List] :=
  Dispatch[MapThread[Rule, {#, Range[Length[#]]} &[tagnames]]];

(* Tokenize *)
tagSplit[text_String, {tagrules__Rule}] :=
  DeleteCases[StringSplit[text, {tagrules}],
   x_ /; StringMatchQ[x, Whitespace | "" ~~ ">" ~~ Whitespace | ""]];

splitText[text_String, pairedtags_List, unpairedtags_List] :=
  tagSplit[text, makeTagReplaceRules[pairedtags, unpairedtags]];

(* Insert containers for attributes as a pre-parsing stage *)
preparse[text_] :=
  Module[{step1},
   With[{pos = Position[text, {_, "Open"} | {_, _, "Open"}, Infinity]},
    step1 =
     ReplacePart[text, attribContainer /@ Extract[text, pos + 1],
      pos + 1, List /@ Range[Length[pos]]]]];

(* insert tag depth for each open/close tag sublist in the tokenized \
list *)
openCloseEnumerate[splittext_List, pairedtags_List,
   unpairedtags_List] :=
  Module[{tagnames, taghashrules, tagtitlecounters,
    unpairedstack = {}, temptag},
   tagnames = getTagNames[pairedtags, unpairedtags];
   taghashrules = makeTagHashRules[tagnames];
   tagtitlecounters = Table[0, {Length[tagnames]}];
   Map[Switch[#, {x_, "Open"}, {#[[1]],
       "Open", ++tagtitlecounters[[#[[1]] /. taghashrules]]}, {x_,
       "Close"}, {#[[1]], "Close",
       tagtitlecounters[[#[[1]] /. taghashrules]]--},
      {x_, "Unpaired", "Open"},
      AppendTo[unpairedstack, #]; {#[[1]],
       "Open", ++tagtitlecounters[[#[[1]] /. taghashrules]]}, {"/>" |
        ">", "UnpairedClose"},
      If[Length[unpairedstack] > 0,
       temptag = unpairedstack[[-1]];
       unpairedstack = Most[unpairedstack];
       {temptag[[1]], "Close",
        tagtitlecounters[[temptag[[1]] /. taghashrules]]--},
       (* else *)
       #
       ] ,
      _, #] &, splittext]];

getOpenCloseForm[text_String, pairedtags_List, unpairedtags_List] :=
  openCloseEnumerate[
   preparse@splitText[text, pairedtags, unpairedtags], pairedtags,
   unpairedtags];

(* Create a list of all used tags and their maximal depth *)
makeTagDepthList[opencloseform_List, pairedtags_List,
   unpairedtags_List] :=
  DeleteCases[{#,
      Max[Cases[
        opencloseform, {#, "Open" | "Close", x_Integer} :> x]]} & /@
    getTagNames[pairedtags, unpairedtags], {x_, -Infinity}];

(* Parse a single depth level of a given tag *)
oneStepParse[parsed_, depth_Integer, tag_String, head_] :=
  Module[{plist =
     Position[parsed, {tag, #, depth}, Infinity] & /@ {"Open",
       "Close"}}, groupElementsNested[parsed, plist, head]];

(* Parse given tag, all levels *)
tagProcess[parseme_, {tag_String, maxdepth_Integer}] :=
  Module[{hd = ToExpression[ tag <> "Container"], result},
   With[{hd1 = hd, ourtag = tag},
    hd1[{ourtag, "Open", n_}, x__, {ourtag, "Close", n_}] := hd1[x];
    result =
     Fold[oneStepParse[#1, #2, ourtag, hd1] &, parseme,
      Range[maxdepth, 1, -1]]];
   Clear[hd];
   result];

(* parse all tags *)
openCloseProcess[opencloseform_List, pairedtags_List,
   unpairedtags_List] :=  Fold[tagProcess, opencloseform,
   makeTagDepthList[opencloseform, pairedtags, unpairedtags]];

(* A few higher - level functions to combine steps *)

documentParse[text_String, pairedtags_List, unpairedtags_List] :=
  openCloseProcess[
   getOpenCloseForm[text, pairedtags, unpairedtags],
   pairedtags, unpairedtags];

refineParsed[parsed_] :=
  If[# === {}, #, First@#] &@Cases[parsed, _htmlContainer];

parseText[text_String] :=
  Module[{tags, paired, unpaired, parsed},
   tags = refineTags@getAllUsedTags@text;
   paired =  getPairedTags@tags;
   unpaired = Complement[tags, Flatten@paired];
   parsed =
    refineParsed@
      documentParse[text, paired, unpaired] /. {">",
       "UnpairedClose"} :> ">";
   {parsed, paired, unpaired}];

(* Keep the page skeleton only *)
removeLeaves[parsed_] := DeleteCases[parsed, _, {-1}];

(* Some optional post-processing - but leads to information loss *)
postProcess[parsed_] := DeleteCases[parsed, ">" | "", Infinity];

-------------------------------------------------------------------------------------

Some tidbits on the implementation: I developed a groupElementsNested
function which can be useful in many contexts, and allows to wrap elements
inside possibly nested expressions in additional wrappers, in reasonably
efficient manner, based on lists of their positions. The parser is based on
this functionality, and is breadth-first rather than depth-first. This
allows it to parse individual html tags and depth levels, if needed - for
example, we could instruct it to parse only <div> tags, etc. This also means
that it will not break completely on malformed html - may be it will just
not parse all tags or levels.

The parser computes used html tags dynamically from the document, and is
completely heuristic in that no strict html format rules are built into it.
So, it is not at all industrial strength or production quality. However, I
used the parser for my purposes and it worked reasonably well for me (in
terms of correctness and efficiency).

What it gives you is a symbolic tree representation of the parsed html
document, with different tags transformed into ...Container[content], for
example divContainer[aContainer["Some text"]]. The container names mirror
the html tag names, except for attribContainer which is for tag attributes.
Also, without post-processing, no information is lost - I was able to
reconstruct back the original html files from this symbolic representation,
I think in all cases I looked at.

Given this structure, one can use Mathematica rules and patterns  to further
process the document to one's needs. For example, here we will reconstruct
the text in the wikipedia page about parsers, reasonably well IMO:

page = Import["http://en.wikipedia.org/wiki/Parser";, "text"];
parsed  = postProcess@parseText[page];
StringJoin @@
 Cases[Cases[parsed, _pContainer, Infinity] /. _attribContainer :>
    Sequence[], _String, Infinity]

The code was (and is) intended to go under GPL, so you can treat it as such.
Once again, it has not been polished and does certainly contain some number
of bugs. If you discover any, please let me know.

Hope this helps.

Regards,
Leonid






On Sun, Jun 13, 2010 at 1:13 AM, kevin <kevin999koshy at gmail.com> wrote:

> Hi Guys,
>
>      Is there any way to use mathematica to read all the words of a
> website, say www.bloomberg.com? Thanks in advance.
>
> Best,
> Kevin
>
>
Prev by Date: Re: difficulty using FindRoot
Next by Date: Re: DiscretePlot
Previous by thread: Re: Using mathematica to read website
Next by thread: Re: Find value of unknown const that causes integral to