MathGroup Archive: February 2006 [00453]

[Date Index] [Thread Index] [Author Index]

Re: Statistical Analysis & Pattern Matching

To: mathgroup at smc.vnet.net
Subject: [mg64540] Re: Statistical Analysis & Pattern Matching
From: Bill Rowe <readnewsciv at earthlink.net>
Date: Mon, 20 Feb 2006 22:31:22 -0500 (EST)
Sender: owner-wri-mathgroup at wolfram.com

On 2/20/06 at 6:29 AM, virtualadepts at gmail.com wrote:

>If I have a set of random data, which could be the result of rolling a
>6 sided die 1,000 times, and the die is favored to rolls 2 numbers
>more often than the others, how do I analyze the data to determine
>which numbers it favors without knowing in advance that it favors any
>of the numbers?

For this particular case a reasonably efficient method would be to generate a table showing each number and the number of times each occurs, i.e.,

{First@#,Length@#}&/@Split[Sort@data]

>Considering I am looking at random data it is impossible to say if
>the dice favors any number for sure, but I can assume that it
>favors a number and check to see which numbers it would favor if it
>did.

>That is an example of the type of problem I want to solve but I can
>think of others.  How about an algorithm that generates random
>numbers between 1 and 1,000,000.  Lets say I have a database of 10
>million numbers it has generated, and want to determine what
>numbers it favors.

This is clearly too many values for a simple table to be useful. In fact, the range of the data is too large to make a plot of the histogram useful. So, you would need to compute a meaningful statistic. Which statistic depends on exactly what you want to know.

>This is not the same question as asking if it is random data, because
>for our purposes it is random.  This is just asking if it is more
>likely to produce certain number.

Yes, they are different questions. But depending on exactly what information you want the test used may be the same. The examples you have given above suggest you are always starting with a set of numbers assumed to be drawn from a uniform distribution. So, if I wanted to determine there was at least one number that occurred more often than expected but did not need to identify that number I would likely compute either a chi squared test statistic or the Kolmogorov-Smirnov test statistic. 

For the problem of 10 million random integers between 1 and 1 million, the expected number of times each should occur is 10. So, a chi squared statistic could be computed as

Total[(10-Length@#)^2&/@Split[Sort@data]] + 
100 Length@Complement[Range@10^6,data]

OTOH, if you want to identify the number that occurs most often, then

#1[[Ordering[Last/@#1, -1]]]&@
  ({First@#, Length@#}&/@Split[Sort@data])

will give you the value that occurs most often and the number of times that value occurs. This could be compared with percentage points of the binomial distribution to see if the number of times the value occurred is significantly more (or less) than expected.

Or if you thought there might be several values that occurred more often than they should the most frequent n values with their counts can be extracted with

#1[[Ordering[Last/@#1, -n]]]&@
  ({First@#, Length@#}&/@Split[Sort@data])

Alternatively, it may be more useful to find all values that occur more often than a given threshold. This could be done as:

Cases[{First@#, Length@#}&/@Split[Sort@data],{_,_?(#>threshold&)}]

>Lets say for this example that the machine is programmed to never
>produce the same number twice, until it has randomly generated
>every other possible number.  Is there a way to predict this is
>happening by looking at the data?  Normally the gamblers fallacy
>isn't a useful idea, but in this case it would help you know in
>advance what the machine will generate because that is how it is
>programmed.

Possibly. For example, a popular method to generate random numbers is a linear congruential generator. And with values created using this method it is possible to compute the parameters of the generator with a sample less than the period of the generator. But, this problem has no general solution. If the algorithm used is well designed, it will not be possible to predict the sequence without first having the output from the entire period. And with a suitably large period, the storage requirements are high enough that there is no practical way to create a program that can predict the next number output from the random number generator being studied.

>How would I solve these problems using mathematica?

Solving the last problem you posed requires a lot more discussion than appropriate for this medium. Hopefully, the comments I made above give you some ideas for solving the other problems you posed.
--
To reply via email subtract one hundred and four

Prev by Date: Re: Statistical Analysis & Pattern Matching

Next by Date: Re: Map-like behaviour for functions of more than a single argument?

Previous by thread: Re: Statistical Analysis & Pattern Matching

Next by thread: Re: Step by Step Annotated Derivations