Using Mathematica for text mining
- To: mathgroup at smc.vnet.net
- Subject: [mg116266] Using Mathematica for text mining
- From: Cameron Christiansen <cam at byu.edu>
- Date: Wed, 9 Feb 2011 02:09:41 -0500 (EST)
I've been playing around with Mathematica looking to see its text mining
capabilities. I've mostly tried to use FindClusters. However I seem to have
a disconnect on what I'm trying to do and what the system is trying to do.
One thing I'd like to do is document clustering. I have a number of files
each representing a document. I'd then like to have the documents clustered
together based on similarities in word usage. This is how I've approached it
thus far:
dir = SetDirectory["Projects/data/19960820"];
files = FileNames["*.xml"]
documents = Import[#, "Plaintext"] & /@ files
However, whenever I run operations on it, weird things happen. For example
when I perform FindClusters on it:
Framed[Column[#]] & /@ FindClusters[documents, 20]
I get the result:
{<<1>>, <<1>>, <<1>>, <<1>>, <<1>>, <<1>>, <<1>>, <<1>>, <<1>>, <<1>>,
<<1>>,
<<1>>, <<1>>, <<1>>, <<1>>, <<1>>, <<1>>, <<1>>, <<1>>, <<1>>}
Can anyone give me pointers? I'm pretty new to Mathematica and would love
the help. Also, does anyone have resources / information on text mining in
Mathematica?
Thanks.