Re: Using Mathematica for text mining

*To*: mathgroup at smc.vnet.net*Subject*: [mg116288] Re: Using Mathematica for text mining*From*: Cameron Christiansen <cam at byu.edu>*Date*: Thu, 10 Feb 2011 05:20:25 -0500 (EST)

Thank you for the response. It looks like that works well to cluster words in a single document together, however I'd like to cluster entire documents together based on the words they contain. Is that possible? On Wed, Feb 9, 2011 at 5:00 AM, Hans Michel <hmichel at cox.net> wrote: > I think you may need to use StringSplit[] > > I think your variable "documents" is a list of Strings (huge strings). > > I went to http://xml.coverpages.org/bosakShakespeare200.html > > Downloaded the zip file which contains xml (TEI) versions of the plays of > Shakespeare. Saved it locally. > > document = > Import["D:\\Downloads\\junk\\shaks200\\as_you.xml", "Plaintext"] > > docsplit = StringSplit[document] > > Framed[Column[#]] & /@ FindClusters[Take[docsplit, 2000], 20] > > Returned something. Not an expert on this function. > > These were rough and quick test. StringSplit would need better rules to > remove ".,!:;..." > > Hans > > -----Original Message----- > From: Cameron Christiansen [mailto:cam at byu.edu] > Sent: Wednesday, February 09, 2011 1:10 AM > To: mathgroup at smc.vnet.net > Subject: [mg116266] Using Mathematica for text mining > > I've been playing around with Mathematica looking to see its text mining > capabilities. I've mostly tried to use FindClusters. However I seem to have > a disconnect on what I'm trying to do and what the system is trying to do. > One thing I'd like to do is document clustering. I have a number of files > each representing a document. I'd then like to have the documents clustered > together based on similarities in word usage. This is how I've approached > it > thus far: > > dir = SetDirectory["Projects/data/19960820"]; > files = FileNames["*.xml"] > documents = Import[#, "Plaintext"] & /@ files > > However, whenever I run operations on it, weird things happen. For example > when I perform FindClusters on it: > > Framed[Column[#]] & /@ FindClusters[documents, 20] > > I get the result: > {<<1>>, <<1>>, <<1>>, <<1>>, <<1>>, <<1>>, <<1>>, <<1>>, <<1>>, <<1>>, > <<1>>, > <<1>>, <<1>>, <<1>>, <<1>>, <<1>>, <<1>>, <<1>>, <<1>>, <<1>>} > > Can anyone give me pointers? I'm pretty new to Mathematica and would love > the help. Also, does anyone have resources / information on text mining in > Mathematica? > > Thanks. > >