MathGroup Archive 2011

[Date Index] [Thread Index] [Author Index]

Search the Archive

Re: Count Ouccrence of words in a long text

  • To: mathgroup at smc.vnet.net
  • Subject: [mg118988] Re: Count Ouccrence of words in a long text
  • From: Murray Eisenberg <murray at math.umass.edu>
  • Date: Thu, 19 May 2011 07:40:47 -0400 (EDT)

Ah...I forgot about Tally, which several posters used; that eliminates 
the need for Union and count.

But none of the posters addressed the issue that my response raised as 
an aside: how to deal with the trailing "s" in possessives. Then there's 
the additional question of how to handle contractions--indeed, how to 
distinguish a contraction such as "here's" from a possessive such as 
"world's".

That is no so much an issue of coding as an issue of deducing semantics 
from textual syntax. Or at least using look-up in a dictionary in order 
to detect contractions.

Any indexers or concordance authors read this list and have a nice solution?

On 5/18/2011 7:18 AM, Murray Eisenberg wrote:
> Here's one approach, which I've encapsulated in a Module for convenience:
>
>     wordCounts[txt_] :=
>       Module[{words,unique,counts},
>         words=StringCases[ToLowerCase[txt],WordCharacter..];
>         unique=Union[words];
>         counts=Count[words,#]&/@unique;
>         Reverse@SortBy[Transpose[{unique,counts}],Last]
>     ]
>
>     (* example *)
>     story = ExampleData[{"Text", "AliceInWonderland"}];
>     wordCounts[story]
>
> {{"the", 632}, {"and", 338}, {"a", 278}, {"to", 252}, {"she",
>     242}, {"of", 199},...
>
> If you want a nice table printout, just use TableForm:
>
>      wordCounts[story] // TableForm
>
> There's at least one anomaly: the "s" at the end of possessives is split
> off as a separate word.
>
> On 5/17/2011 7:47 AM, Yako wrote:
>> Hello,
>>
>> First of all I am pretty new to Mathematica, so excuse me if this has
>> a simple answer.
>>
>> What I need is to be able to count the occurrence of each word of a
>> text and count the times each word appears on it. I know how to do
>> this on other languages but I am trying to achieve it with
>> mathematica.
>>
>> Can someone hint me the way to go?
>>
>> Thanks!
>>
>

-- 
Murray Eisenberg                     murray at math.umass.edu
Mathematics & Statistics Dept.
Lederle Graduate Research Tower      phone 413 549-1020 (H)
University of Massachusetts                413 545-2859 (W)
710 North Pleasant Street            fax   413 545-1801
Amherst, MA 01003-9305


  • Prev by Date: Re: Ignore missing data
  • Next by Date: Re: Ignore missing data
  • Previous by thread: Re: Count Ouccrence of words in a long text
  • Next by thread: Re: Count Ouccrence of words in a long text