Re: Scriptable Mathematica tools for auto-editing text?
- To: mathgroup at smc.vnet.net
- Subject: [mg48667] Re: Scriptable Mathematica tools for auto-editing text?
- From: "John Jowett" <John.Jowett at cern.ch>
- Date: Thu, 10 Jun 2004 02:42:59 -0400 (EDT)
- Organization: CERN
- References: <c9haop$1ar$1@smc.vnet.net>
- Sender: owner-wri-mathgroup at wolfram.com
The standard Mathematica functions provide a very powerful toolkit for this kind of thing. There are various approaches depending on what you want to do. It is worth reading the section "Files and Streams" in the Mathematica book and also being familiar with the Import, ReadList and Export functions. There are several approaches. It is often simplest just to convert a text file into a list of single characters and then apply rules. This can be done as a single stream or on a line-by-line basis if the line structure of the file is important (use Import[...,"Lines"] in that case. As a worked example of this technique, here is a way to extract links from an HTML file. It requires some knowledge of rules and patterns in Mathematica. First do it step-by-step (use some html file to see how it works): Import the file htmlCode = Import["somepage.html", "Text"] Break this long string into characters: htmlChars = Characters[htmlCode] Look for a pattern delimiting a sequence of characters that should be a link (there may be other cases but let's take the common href="...." syntax for this example) and put the contents into a wrapper (called htmlLink here): htmlLinks = htmlChars //. {before___, "h", "r", "e", "f", "=", "\"", link__, "\"", after___} :> {before, htmlLink[StringJoin[link]], after} Throw away everything that did not get wrapped htmlLinks = Cases[htmlLinks, htmlLink[_]] The wrapper has served its purpose and can now be removed: htmlLinks /. htmlLink[l_] :> l Finally, here is the whole process wrapped up in a function without the intermediate storage variables. This could be upgraded in various ways. linksInHTMLFile::usage="linksInHTMLFile[file] returns a list of all the links in an HTML file." linksInHTMLFile[filename_String] := Module[{htmlLink}, Union[Cases[Characters[ Import[filename, "Text"]] //.{before___, "h", "r", "e", "f", "=", "\"", link__, "\"", after___} :> {before, htmlLink[StringJoin[link]], after}, htmlLink[_]] /.htmlLink[l_] :> l] ] The Union removes duplicates and sorts the links. John Jowett "AES/newspost" <siegman at stanford.edu> wrote in message news:c9haop$1ar$1 at smc.vnet.net... > Anyone know of a Mathematica package or set of tools that can be used to > operate on a text file and do, scriptably, the kinds of things one could > do by hand using one of the more powerful text editors, like BBEdit, > emacs or QUED/M? -- a set of modules that would do Search and Replace > with some grep capabilities, pull out all the links in an HTML document, > and so on. > > I'm sure I cobble up an amateur effort myself to do this, but has > someone else already done it better? >