Re: Using Mathematica for text mining
- To: mathgroup at smc.vnet.net
- Subject: [mg116300] Re: Using Mathematica for text mining
- From: "Sjoerd C. de Vries" <sjoerd.c.devries at gmail.com>
- Date: Thu, 10 Feb 2011 05:22:48 -0500 (EST)
- References: <iitejg$ji2$1@smc.vnet.net>
Hi Cameron, My assumption what happens here is that Mathematica succeeds in finding appropriate clusters and now ouputs them in their entirety. In this case in the form of nested lists with the actual texts contained in the documents. This probably results in pretty long output which Mathematica presents in skeleton form. I'd suggest using the second syntax form of FindClusters and give each document a more or less descriptive label. The label will now be used in the output instead of the documents themselves. I guess the document names will do. Another thing is that we need a distance measure for comparing documents. The default for text (edit distance) doesn't make much sense here given the large dissimilarity of the texts (at least, I assume that this is the case here). I'll use the character distribution in each of my sample documents to compare similarity (you may come up with a measure of your own, and put it in the option DistanceFunction). (* Titles of sample documents included in Mathematica. I excluded texts with non-latin alphabets and multiple variants of the UN declaration *) titles = ExampleData[ "Text"][[{1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 28, 29, 50}]]; (* load documents *) documents = ExampleData[#] & /@ titles; (* count the various characters used *) charTally = Tally /@ ToCharacterCode /@ ToLowerCase /@ documents; (* convert counts to percentages and convert the tallies to a table *) charCountsTable = Normal[(#/Total[#] // N) & /@ SparseArray[ Flatten[MapIndexed[{#2[[1]], #1[[1]]} -> #1[[2]] & , charTally, {2}], 2]]]; (* now find the clusters using the titles as labels *) FindClusters[Thread[Rule[charCountsTable, titles[[All, 2]]]], 5] // TableForm Out[121]//TableForm= TableForm[{{ "AeneidEnglish", "BeowulfModern", "MagnaCarta", "OnTheNatureOfThingsEnglish", "ShakespearesSonnets", "ToBeOrNotToBe"}, { "AeneidLatin", "DonQuixoteISpanish", "LesFleursDuMal", "LoremIpsum"}, { "AliceInWonderland", "CodeOfHammurabiEnglish", "DonQuixoteIEnglish", "FriendsRomansCountrymen", "GenesisKJV", "PlatoMenoEnglish", "Prufrock"}, { "BeowulfOldEnglish", "GettysburgAddress", "Hamlet", "JFKInaugural", "PrideAndPrejudice", "TheRaven"}, { "DeclarationOfIndependence", "FaustI", "FederalistTen", "OriginOfSpecies", "USConstitution"}}] Note that all non-english texts are grouped in the second cluster. Cheers -- Sjoerd On Feb 9, 8:09 am, Cameron Christiansen <c... at byu.edu> wrote: > I've been playing around with Mathematica looking to see its text mining > capabilities. I've mostly tried to use FindClusters. However I seem to ha= ve > a disconnect on what I'm trying to do and what the system is trying to do=