MathGroup Archive 2011

[Date Index] [Thread Index] [Author Index]

Search the Archive

Re: Using Mathematica for text mining

  • To: mathgroup at smc.vnet.net
  • Subject: [mg116288] Re: Using Mathematica for text mining
  • From: Cameron Christiansen <cam at byu.edu>
  • Date: Thu, 10 Feb 2011 05:20:25 -0500 (EST)

Thank you for the response. It looks like that works well to cluster words
in a single document together, however I'd like to cluster entire documents
together based on the words they contain. Is that possible?

On Wed, Feb 9, 2011 at 5:00 AM, Hans Michel <hmichel at cox.net> wrote:

> I think you may need to use StringSplit[]
>
> I think your variable "documents" is a list of Strings (huge strings).
>
> I went to http://xml.coverpages.org/bosakShakespeare200.html
>
> Downloaded the zip file which contains xml (TEI) versions of the plays of
> Shakespeare. Saved it locally.
>
> document =
>  Import["D:\\Downloads\\junk\\shaks200\\as_you.xml", "Plaintext"]
>
> docsplit = StringSplit[document]
>
> Framed[Column[#]] & /@ FindClusters[Take[docsplit, 2000], 20]
>
> Returned something. Not an expert on this function.
>
> These were rough and quick test. StringSplit would need better rules to
> remove ".,!:;..."
>
> Hans
>
> -----Original Message-----
> From: Cameron Christiansen [mailto:cam at byu.edu]
> Sent: Wednesday, February 09, 2011 1:10 AM
> To: mathgroup at smc.vnet.net
> Subject: [mg116266] Using Mathematica for text mining
>
> I've been playing around with Mathematica looking to see its text mining
> capabilities. I've mostly tried to use FindClusters. However I seem to have
> a disconnect on what I'm trying to do and what the system is trying to do.
> One thing I'd like to do is document clustering. I have a number of files
> each representing a document. I'd then like to have the documents clustered
> together based on similarities in word usage. This is how I've approached
> it
> thus far:
>
> dir = SetDirectory["Projects/data/19960820"];
> files = FileNames["*.xml"]
> documents = Import[#, "Plaintext"] & /@ files
>
> However, whenever I run operations on it, weird things happen. For example
> when I perform FindClusters on it:
>
> Framed[Column[#]] & /@ FindClusters[documents, 20]
>
> I get the result:
> {<<1>>, <<1>>, <<1>>, <<1>>, <<1>>, <<1>>, <<1>>, <<1>>, <<1>>, <<1>>,
> <<1>>,
> <<1>>, <<1>>, <<1>>, <<1>>, <<1>>, <<1>>, <<1>>, <<1>>, <<1>>}
>
> Can anyone give me pointers? I'm pretty new to Mathematica and would love
> the help. Also, does anyone have resources / information on text mining in
> Mathematica?
>
> Thanks.
>
>


  • Prev by Date: Re: Using Mathematica to Grab RSS Feeds
  • Next by Date: Re: Polygon projection in CountryData incorrect?
  • Previous by thread: Re: Using Mathematica for text mining
  • Next by thread: Re: Using Mathematica for text mining