MathGroup Archive: February 2011 [00246]

[Date Index] [Thread Index] [Author Index]

Re: Using Mathematica for text mining

To: mathgroup at smc.vnet.net
Subject: [mg116300] Re: Using Mathematica for text mining
From: "Sjoerd C. de Vries" <sjoerd.c.devries at gmail.com>
Date: Thu, 10 Feb 2011 05:22:48 -0500 (EST)
References: <iitejg$ji2$1@smc.vnet.net>

Hi Cameron,

My assumption what happens here is that Mathematica succeeds in
finding appropriate clusters and now ouputs them in their entirety. In
this case in the form of nested lists with the actual texts contained
in the documents. This probably results in pretty long output which
Mathematica presents in skeleton form.

I'd suggest using the second syntax form of FindClusters and give each
document a more or less descriptive label. The label will now be used
in the output instead of the documents themselves. I guess the
document names will do.

Another thing is that we need a distance measure for comparing
documents. The default for text (edit distance) doesn't make much
sense here given the large dissimilarity of the texts (at least, I
assume that this is the case here). I'll use the character
distribution in each of my sample documents to compare similarity (you
may come up with a measure of your own, and put it in the option
DistanceFunction).

(* Titles of sample documents included in Mathematica. I excluded
texts with non-latin alphabets and multiple variants of the UN
declaration *)
titles = ExampleData[
    "Text"][[{1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 17,
     18, 19, 20, 21, 22, 23, 24, 25, 26, 28, 29, 50}]];

(* load documents *)
documents = ExampleData[#] & /@ titles;

(* count the various characters used *)
charTally = Tally /@ ToCharacterCode /@ ToLowerCase /@ documents;

(* convert counts to percentages and convert the tallies to a table *)
charCountsTable =
  Normal[(#/Total[#] // N) & /@
    SparseArray[
     Flatten[MapIndexed[{#2[[1]], #1[[1]]} -> #1[[2]] & ,
       charTally, {2}], 2]]];

(* now find the clusters using the titles as labels *)
FindClusters[Thread[Rule[charCountsTable, titles[[All, 2]]]], 5] //
TableForm

Out[121]//TableForm= TableForm[{{
  "AeneidEnglish", "BeowulfModern", "MagnaCarta",
   "OnTheNatureOfThingsEnglish", "ShakespearesSonnets",
   "ToBeOrNotToBe"}, {
  "AeneidLatin", "DonQuixoteISpanish", "LesFleursDuMal",
   "LoremIpsum"}, {
  "AliceInWonderland", "CodeOfHammurabiEnglish", "DonQuixoteIEnglish",
    "FriendsRomansCountrymen", "GenesisKJV", "PlatoMenoEnglish",
   "Prufrock"}, {
  "BeowulfOldEnglish", "GettysburgAddress", "Hamlet", "JFKInaugural",
   "PrideAndPrejudice", "TheRaven"}, {
  "DeclarationOfIndependence", "FaustI", "FederalistTen",
   "OriginOfSpecies", "USConstitution"}}]

Note that all non-english texts are grouped in the second cluster.

Cheers -- Sjoerd

On Feb 9, 8:09 am, Cameron Christiansen <c... at byu.edu> wrote:
> I've been playing around with Mathematica looking to see its text mining
> capabilities. I've mostly tried to use FindClusters. However I seem to ha=
ve
> a disconnect on what I'm trying to do and what the system is trying to do=

Prev by Date: NDSolve and NumericQ

Next by Date: Re: list manipulation

Previous by thread: Re: Using Mathematica for text mining

Next by thread: Re: Using Mathematica for text mining