Re: Using Mathematica for text mining
- To: mathgroup at smc.vnet.net
- Subject: [mg116304] Re: Using Mathematica for text mining
- From: "Hans Michel" <hmichel at cox.net>
- Date: Thu, 10 Feb 2011 05:23:35 -0500 (EST)
I think you may need to use StringSplit[] I think your variable "documents" is a list of Strings (huge strings). I went to http://xml.coverpages.org/bosakShakespeare200.html Downloaded the zip file which contains xml (TEI) versions of the plays of Shakespeare. Saved it locally. document = Import["D:\\Downloads\\junk\\shaks200\\as_you.xml", "Plaintext"] docsplit = StringSplit[document] Framed[Column[#]] & /@ FindClusters[Take[docsplit, 2000], 20] Returned something. Not an expert on this function. These were rough and quick test. StringSplit would need better rules to remove ".,!:;..." Hans -----Original Message----- From: Cameron Christiansen [mailto:cam at byu.edu] Sent: Wednesday, February 09, 2011 1:10 AM To: mathgroup at smc.vnet.net Subject: [mg116304] [mg116266] Using Mathematica for text mining I've been playing around with Mathematica looking to see its text mining capabilities. I've mostly tried to use FindClusters. However I seem to have a disconnect on what I'm trying to do and what the system is trying to do. One thing I'd like to do is document clustering. I have a number of files each representing a document. I'd then like to have the documents clustered together based on similarities in word usage. This is how I've approached it thus far: dir = SetDirectory["Projects/data/19960820"]; files = FileNames["*.xml"] documents = Import[#, "Plaintext"] & /@ files However, whenever I run operations on it, weird things happen. For example when I perform FindClusters on it: Framed[Column[#]] & /@ FindClusters[documents, 20] I get the result: {<<1>>, <<1>>, <<1>>, <<1>>, <<1>>, <<1>>, <<1>>, <<1>>, <<1>>, <<1>>, <<1>>, <<1>>, <<1>>, <<1>>, <<1>>, <<1>>, <<1>>, <<1>>, <<1>>, <<1>>} Can anyone give me pointers? I'm pretty new to Mathematica and would love the help. Also, does anyone have resources / information on text mining in Mathematica? Thanks.