Mathematica 9 is now available
Services & Resources / Wolfram Forums / MathGroup Archive
-----

MathGroup Archive 2011

[Date Index] [Thread Index] [Author Index]

Search the Archive

Re: Using Mathematica for text mining

  • To: mathgroup at smc.vnet.net
  • Subject: [mg116304] Re: Using Mathematica for text mining
  • From: "Hans Michel" <hmichel at cox.net>
  • Date: Thu, 10 Feb 2011 05:23:35 -0500 (EST)

I think you may need to use StringSplit[]

I think your variable "documents" is a list of Strings (huge strings). 

I went to http://xml.coverpages.org/bosakShakespeare200.html

Downloaded the zip file which contains xml (TEI) versions of the plays of
Shakespeare. Saved it locally.

document = 
 Import["D:\\Downloads\\junk\\shaks200\\as_you.xml", "Plaintext"]

docsplit = StringSplit[document]

Framed[Column[#]] & /@ FindClusters[Take[docsplit, 2000], 20]

Returned something. Not an expert on this function.

These were rough and quick test. StringSplit would need better rules to
remove ".,!:;..."

Hans

-----Original Message-----
From: Cameron Christiansen [mailto:cam at byu.edu] 
Sent: Wednesday, February 09, 2011 1:10 AM
To: mathgroup at smc.vnet.net
Subject: [mg116304] [mg116266] Using Mathematica for text mining

I've been playing around with Mathematica looking to see its text mining
capabilities. I've mostly tried to use FindClusters. However I seem to have
a disconnect on what I'm trying to do and what the system is trying to do.
One thing I'd like to do is document clustering. I have a number of files
each representing a document. I'd then like to have the documents clustered
together based on similarities in word usage. This is how I've approached it
thus far:

dir = SetDirectory["Projects/data/19960820"];
files = FileNames["*.xml"]
documents = Import[#, "Plaintext"] & /@ files

However, whenever I run operations on it, weird things happen. For example
when I perform FindClusters on it:

Framed[Column[#]] & /@ FindClusters[documents, 20]

I get the result:
{<<1>>, <<1>>, <<1>>, <<1>>, <<1>>, <<1>>, <<1>>, <<1>>, <<1>>, <<1>>,
<<1>>,
<<1>>, <<1>>, <<1>>, <<1>>, <<1>>, <<1>>, <<1>>, <<1>>, <<1>>}

Can anyone give me pointers? I'm pretty new to Mathematica and would love
the help. Also, does anyone have resources / information on text mining in
Mathematica?

Thanks.



  • Prev by Date: Re: Using Mathematica to Grab RSS Feeds
  • Next by Date: Re: ContourPlot and lines vrs. 8.0
  • Previous by thread: Using Mathematica for text mining
  • Next by thread: Re: Using Mathematica for text mining