MathGroup Archive 2004

[Date Index] [Thread Index] [Author Index]

Search the Archive

Re: Scriptable Mathematica tools for auto-editing text?

  • To: mathgroup at smc.vnet.net
  • Subject: [mg48667] Re: Scriptable Mathematica tools for auto-editing text?
  • From: "John Jowett" <John.Jowett at cern.ch>
  • Date: Thu, 10 Jun 2004 02:42:59 -0400 (EDT)
  • Organization: CERN
  • References: <c9haop$1ar$1@smc.vnet.net>
  • Sender: owner-wri-mathgroup at wolfram.com

The standard Mathematica functions provide a very powerful toolkit for this
kind of thing.  There are various approaches depending on what you want to
do.  It is worth reading the section "Files and Streams" in the Mathematica
book and also being familiar with the Import, ReadList and Export functions.

There are several approaches.  It is often simplest just to convert a text
file into a list of single characters and then apply rules.  This can be
done as a single stream or on a line-by-line basis if the line structure of
the file is important (use Import[...,"Lines"] in that case.

As a worked example of this technique, here is a way to extract links from
an HTML file.  It requires some knowledge of rules and patterns in
Mathematica.   First do it step-by-step (use some html file to see how it
works):

Import the file
htmlCode = Import["somepage.html", "Text"]

Break this long string into characters:

htmlChars = Characters[htmlCode]

Look for a pattern delimiting a sequence of characters that should be a link
(there may be other cases but let's take the common href="...." syntax for
this example) and put the contents into a wrapper (called htmlLink here):

htmlLinks = htmlChars //. {before___, "h", "r", "e", "f", "=", "\"", link__,
"\"", after___} :> {before, htmlLink[StringJoin[link]], after}

Throw away everything that did not get wrapped

htmlLinks = Cases[htmlLinks, htmlLink[_]]

The wrapper has served its purpose and can now be removed:

htmlLinks /. htmlLink[l_] :> l

Finally, here is the whole process wrapped up in a  function without the
intermediate storage variables.  This could be upgraded in various ways.

linksInHTMLFile::usage="linksInHTMLFile[file] returns a list of all the
links in an HTML file."

linksInHTMLFile[filename_String] := Module[{htmlLink},
         Union[Cases[Characters[
                   Import[filename, "Text"]] //.{before___, "h", "r", "e",
"f", "=", "\"", link__, "\"",    after___} :>
                                                             {before,
htmlLink[StringJoin[link]], after}, htmlLink[_]]
                   /.htmlLink[l_] :> l]
         ]

The Union removes duplicates and sorts the links.

John Jowett

"AES/newspost" <siegman at stanford.edu> wrote in message
news:c9haop$1ar$1 at smc.vnet.net...
> Anyone know of a Mathematica package or set of tools that can be used to
> operate on a text file and do, scriptably, the kinds of things one could
> do by hand using one of the more powerful text editors, like BBEdit,
> emacs or QUED/M? -- a set of modules that would do Search and Replace
> with some grep capabilities, pull out all the links in an HTML document,
> and so on.
>
> I'm sure I cobble up an amateur effort myself to do this, but has
> someone else already done it better?
>



  • Prev by Date: RE: ListContourPlot and missing data
  • Next by Date: [Off Topic] Re: Re: What is zero divided by zero?
  • Previous by thread: Re: Re: Scriptable Mathematica tools for auto-editing text?
  • Next by thread: Re: Scriptable Mathematica tools for auto-editing text?