Re: Using mathematica to read website
- To: mathgroup at smc.vnet.net
- Subject: [mg110367] Re: Using mathematica to read website
- From: Leonid Shifrin <lshifr at gmail.com>
- Date: Tue, 15 Jun 2010 02:30:50 -0400 (EDT)
Hi Kevin, I think that the notion of "words" is not sharply defined in this context, since html is not a plain text file - you have words in main text, in links, image descriptions and many other places. I have written a simplistic HTML parser a couple of years ago for my own purposes. I intended (and still do) to polish it, convert into one or several packages and make available to anybody interested - but never had time to do it. Your post prompted me to speed things up in this regard, and as a first step, here is the code as it is now - may be you or someone else will find it of some use, since it will still take me some time to bring it to the proper package format. ------------------------------------------------------------------------------------- ClearAll[listSplit, reconstructIntervals, groupElements, groupPositions, processPosList, groupElementsNested, getAllUsedTags, refineTags, getTagTitle, getPairedTags, makeTagReplaceRules, getTagNames, makeTagHashRules, tagSplit, splitText, preparse, openCloseEnumerate, getOpenCloseForm, makeTagDepthList, oneStepParse, tagProcess, openCloseProcess, documentParse, refineParsed, parseText, removeLeaves, attribContainer, postProcess]; listSplit[x_List, lengthlist_List, headlist_List] := MapThread[#1 @@ Take[x, #2] &, {headlist, Transpose[{Most[#] + 1, Rest[#]} &[ FoldList[Plus, 0, lengthlist]]]}]; reconstructIntervals[listlen_Integer, ints_List] := Module[{missed, startint, lastint}, startint = If[ints[[1, 1]] == 1, {}, {1, ints[[1, 1]] - 1}]; lastint = If[ints[[-1, -1]] == listlen, {}, {ints[[-1, -1]] + 1, listlen}]; missed = Map[If[#[[2, 1]] - #[[1, 2]] > 1, {#[[1, 2]] + 1, #[[2, 1]] - 1}, {}] &, Partition[ints, 2, 1]]; missed = Join[missed, {lastint}]; Prepend[Flatten[Transpose[{ints, missed}], 1], startint]]; groupElements[lst_List, poslist_List, headlist_List] /; And[OrderedQ[Flatten[Sort[poslist]]], Length[headlist] == Length[poslist]] := Module[{totalheadlist, allints, llist}, totalheadlist = Append[ Flatten[ Transpose[ {Array[Sequence &, {Length[headlist]}], headlist}], 1], Sequence]; allints = reconstructIntervals[Length[lst], poslist]; llist = Map[If[# === {}, 0, 1 - Subtract @@ #] &, allints]; listSplit[lst, llist, totalheadlist]]; (* To work on general heads, we need this *) groupElements[h_[x__], poslist_List, headlist_List] := h[Sequence @@ groupElements[{x}, poslist, headlist]]; (* If we have a single head *) groupElements[expr_, poslist_List, head_] := groupElements[expr, poslist, Table[head, {Length[poslist]}]]; groupPositions[plist_List] := Reap[Sow[Last[#], {Most[#]}] & /@ plist, _, List][[2]]; processPosList[{openlist_List, closelist_List}] := Module[{opengroup, closegroup, poslist}, {opengroup, closegroup} = groupPositions /@ {openlist, closelist} ; poslist = Transpose[Transpose[Sort[#]] & /@ {opengroup, closegroup}]; If[UnsameQ @@ poslist[[1]], Return[(Print["Unmatched lists!", {openlist, closelist}]; {})], poslist = Transpose[{poslist[[1, 1]], Transpose /@ Transpose[poslist[[2]]]}] ] ]; groupElementsNested[nested_, {openposlist_List, closeposlist_List}, head_] /; Head[head] =!= List := Fold[Function[{x, y}, MapAt[groupElements[#, y[[2]], head] &, x, {y[[1]]}]], nested, Sort[processPosList[{openposlist, closeposlist}], Length[#2[[1]]] < Length[#1[[1]]] &]]; (* Find all the tags in the document*) getAllUsedTags[text_String] := Module[{htmlTagsposlist, result, chars = Characters[text], x}, htmlTagsposlist = StringPosition[text, ShortestMatch["<" ~~ x__ ~~ Whitespace | ">"], Overlaps -> True]; result = Union[Map[ToLowerCase, StringJoin @@@ Map[Take[chars, {#[[1]], #[[2]] - 1}] &, htmlTagsposlist]] /. {"<br/" :> "<br"}]]; (* get rid of tags with non-alphabetic characters *) refineTags[tags_List] := Module[{alphabet = Characters["abcdefghijklmnopqrstuvwxyz/"]}, DeleteCases[tags, x_ /; ! MemberQ[alphabet, StringTake[x, {2, 2}]]]]; getTagTitle[tag_String] := If[StringTake[tag, {2, 2}] === "/", StringDrop[tag, {2}], tag]; (* Find all paired tags *) getPairedTags[tags_List] := Reverse /@ Select[Reap[Sow[#, getTagTitle[#]] & /@ tags, _, #2 &][[2]], Length[#] == 2 &]; (* Prepare string replacement rules to be used when we tokenize the \ string *) makeTagReplaceRules[pairedtags_List, unpairedtags_List] := Sort[Join[ Rule @@@ Map[{#, {StringDrop[#, 1], "Unpaired", "Open"}} &, unpairedtags], Rule @@@ Flatten[Transpose[{#, {{StringDrop[#[[1]], 1], "Open"}, {StringDrop[#[[2]], 2], "Close"}}}] & /@ pairedtags, 1], {Rule[">", {">", "UnpairedClose"}]}], StringLength[#1[[2, 1]]] > StringLength[#2[[2, 1]]] &]; getTagNames[pairedtags_List, unpairedtags_List] := StringDrop[#, 1] & /@ Join[Transpose[pairedtags][[1]], unpairedtags]; makeTagHashRules[tagnames_List] := Dispatch[MapThread[Rule, {#, Range[Length[#]]} &[tagnames]]]; (* Tokenize *) tagSplit[text_String, {tagrules__Rule}] := DeleteCases[StringSplit[text, {tagrules}], x_ /; StringMatchQ[x, Whitespace | "" ~~ ">" ~~ Whitespace | ""]]; splitText[text_String, pairedtags_List, unpairedtags_List] := tagSplit[text, makeTagReplaceRules[pairedtags, unpairedtags]]; (* Insert containers for attributes as a pre-parsing stage *) preparse[text_] := Module[{step1}, With[{pos = Position[text, {_, "Open"} | {_, _, "Open"}, Infinity]}, step1 = ReplacePart[text, attribContainer /@ Extract[text, pos + 1], pos + 1, List /@ Range[Length[pos]]]]]; (* insert tag depth for each open/close tag sublist in the tokenized \ list *) openCloseEnumerate[splittext_List, pairedtags_List, unpairedtags_List] := Module[{tagnames, taghashrules, tagtitlecounters, unpairedstack = {}, temptag}, tagnames = getTagNames[pairedtags, unpairedtags]; taghashrules = makeTagHashRules[tagnames]; tagtitlecounters = Table[0, {Length[tagnames]}]; Map[Switch[#, {x_, "Open"}, {#[[1]], "Open", ++tagtitlecounters[[#[[1]] /. taghashrules]]}, {x_, "Close"}, {#[[1]], "Close", tagtitlecounters[[#[[1]] /. taghashrules]]--}, {x_, "Unpaired", "Open"}, AppendTo[unpairedstack, #]; {#[[1]], "Open", ++tagtitlecounters[[#[[1]] /. taghashrules]]}, {"/>" | ">", "UnpairedClose"}, If[Length[unpairedstack] > 0, temptag = unpairedstack[[-1]]; unpairedstack = Most[unpairedstack]; {temptag[[1]], "Close", tagtitlecounters[[temptag[[1]] /. taghashrules]]--}, (* else *) # ] , _, #] &, splittext]]; getOpenCloseForm[text_String, pairedtags_List, unpairedtags_List] := openCloseEnumerate[ preparse@splitText[text, pairedtags, unpairedtags], pairedtags, unpairedtags]; (* Create a list of all used tags and their maximal depth *) makeTagDepthList[opencloseform_List, pairedtags_List, unpairedtags_List] := DeleteCases[{#, Max[Cases[ opencloseform, {#, "Open" | "Close", x_Integer} :> x]]} & /@ getTagNames[pairedtags, unpairedtags], {x_, -Infinity}]; (* Parse a single depth level of a given tag *) oneStepParse[parsed_, depth_Integer, tag_String, head_] := Module[{plist = Position[parsed, {tag, #, depth}, Infinity] & /@ {"Open", "Close"}}, groupElementsNested[parsed, plist, head]]; (* Parse given tag, all levels *) tagProcess[parseme_, {tag_String, maxdepth_Integer}] := Module[{hd = ToExpression[ tag <> "Container"], result}, With[{hd1 = hd, ourtag = tag}, hd1[{ourtag, "Open", n_}, x__, {ourtag, "Close", n_}] := hd1[x]; result = Fold[oneStepParse[#1, #2, ourtag, hd1] &, parseme, Range[maxdepth, 1, -1]]]; Clear[hd]; result]; (* parse all tags *) openCloseProcess[opencloseform_List, pairedtags_List, unpairedtags_List] := Fold[tagProcess, opencloseform, makeTagDepthList[opencloseform, pairedtags, unpairedtags]]; (* A few higher - level functions to combine steps *) documentParse[text_String, pairedtags_List, unpairedtags_List] := openCloseProcess[ getOpenCloseForm[text, pairedtags, unpairedtags], pairedtags, unpairedtags]; refineParsed[parsed_] := If[# === {}, #, First@#] &@Cases[parsed, _htmlContainer]; parseText[text_String] := Module[{tags, paired, unpaired, parsed}, tags = refineTags@getAllUsedTags@text; paired = getPairedTags@tags; unpaired = Complement[tags, Flatten@paired]; parsed = refineParsed@ documentParse[text, paired, unpaired] /. {">", "UnpairedClose"} :> ">"; {parsed, paired, unpaired}]; (* Keep the page skeleton only *) removeLeaves[parsed_] := DeleteCases[parsed, _, {-1}]; (* Some optional post-processing - but leads to information loss *) postProcess[parsed_] := DeleteCases[parsed, ">" | "", Infinity]; ------------------------------------------------------------------------------------- Some tidbits on the implementation: I developed a groupElementsNested function which can be useful in many contexts, and allows to wrap elements inside possibly nested expressions in additional wrappers, in reasonably efficient manner, based on lists of their positions. The parser is based on this functionality, and is breadth-first rather than depth-first. This allows it to parse individual html tags and depth levels, if needed - for example, we could instruct it to parse only <div> tags, etc. This also means that it will not break completely on malformed html - may be it will just not parse all tags or levels. The parser computes used html tags dynamically from the document, and is completely heuristic in that no strict html format rules are built into it. So, it is not at all industrial strength or production quality. However, I used the parser for my purposes and it worked reasonably well for me (in terms of correctness and efficiency). What it gives you is a symbolic tree representation of the parsed html document, with different tags transformed into ...Container[content], for example divContainer[aContainer["Some text"]]. The container names mirror the html tag names, except for attribContainer which is for tag attributes. Also, without post-processing, no information is lost - I was able to reconstruct back the original html files from this symbolic representation, I think in all cases I looked at. Given this structure, one can use Mathematica rules and patterns to further process the document to one's needs. For example, here we will reconstruct the text in the wikipedia page about parsers, reasonably well IMO: page = Import["http://en.wikipedia.org/wiki/Parser", "text"]; parsed = postProcess@parseText[page]; StringJoin @@ Cases[Cases[parsed, _pContainer, Infinity] /. _attribContainer :> Sequence[], _String, Infinity] The code was (and is) intended to go under GPL, so you can treat it as such. Once again, it has not been polished and does certainly contain some number of bugs. If you discover any, please let me know. Hope this helps. Regards, Leonid On Sun, Jun 13, 2010 at 1:13 AM, kevin <kevin999koshy at gmail.com> wrote: > Hi Guys, > > Is there any way to use mathematica to read all the words of a > website, say www.bloomberg.com? Thanks in advance. > > Best, > Kevin > >