MathGroup Archive: May 2000 [00088]

[Date Index] [Thread Index] [Author Index]
Re: Random Geometric (long)
To: mathgroup at smc.vnet.net
Subject: [mg23392] Re: [mg23340] Random Geometric (long)
From: Daniel Lichtblau <danl at wolfram.com>
Date: Fri, 5 May 2000 02:07:30 -0400 (EDT)
References: <200005040659.CAA17403@smc.vnet.net>
Sender: owner-wri-mathgroup at wolfram.com
Bernd Brandt wrote:
> 
> Dear All,
> 
> With the random function, it is possible to take random numbers from a distribution.
> I tried the GeometricDistribution and found something `strange', i think.
> 
> The Plot of the frequency of numbers shows a lower number for 23:
> 
> tg = Table[Random[GeometricDistribution[0.1]], {i, 1, 10000}];
> 
> Table[{i, Count[tg, i]}, {i, 21, 24}]
> {{21, 95}, {22, 98}, {23, 48}, {24, 93}}
> 
> {{21, 1102}, {22, 1015}, {23, 540}, {24, 877} for 100.000 Random numbers.
> 
> Is this indeed something strange for a Random generator?
> It still occurs with more Random numbers, and remains with repeated
> evaluation of the  input cell (the counts change, but 23 remains much lower).
> 
> Regards,
> Bernd

I'll explain as best I can what is going wrong, and how to work around
it. First let's explicitly replicate the problem. I'll use RandomArray
for convenience.

In[1]:= <<Statistics`

In[2]:= tg1 = RandomArray[GeometricDistribution[0.1], 10000];

In[3]:= InputForm[Table[{i,Count[tg1,i]}, {i,0,70}]]
Out[3]//InputForm= 
{{0, 990}, {1, 894}, {2, 794}, {3, 771}, {4, 636}, {5, 570}, {6, 540}, 
 {7, 479}, {8, 397}, {9, 389}, {10, 341}, {11, 305}, {12, 271}, {13,
274}, 
 {14, 243}, {15, 220}, {16, 215}, {17, 166}, {18, 176}, {19, 124}, {20,
117}, 
 {21, 97}, {22, 109}, {23, 49}, {24, 109}, {25, 79}, {26, 78}, {27, 56}, 
 {28, 48}, {29, 39}, {30, 48}, {31, 43}, {32, 33}, {33, 41}, {34, 22}, 
 {35, 23}, {36, 23}, {37, 11}, {38, 19}, {39, 18}, {40, 15}, {41, 16}, 
 {42, 8}, {43, 14}, {44, 14}, {45, 10}, {46, 3}, {47, 5}, {48, 10}, {49,
5}, 
 {50, 3}, {51, 7}, {52, 3}, {53, 5}, {54, 4}, {55, 6}, {56, 3}, {57, 0}, 
 {58, 1}, {59, 0}, {60, 0}, {61, 1}, {62, 0}, {63, 3}, {64, 0}, {65, 1}, 
 {66, 0}, {67, 1}, {68, 1}, {69, 2}, {70, 0}}

10x larger test:

In[4]:= tg1 = RandomArray[GeometricDistribution[0.1], 100000];

In[5]:= InputForm[Table[{i,Count[tg1,i]}, {i,0,70}]]
Out[5]//InputForm= 
{{0, 10113}, {1, 9173}, {2, 7937}, {3, 7239}, {4, 6600}, {5, 5812}, 
 {6, 5360}, {7, 4739}, {8, 4156}, {9, 3952}, {10, 3476}, {11, 3105}, 
 {12, 2817}, {13, 2566}, {14, 2358}, {15, 2026}, {16, 1857}, {17, 1639}, 
 {18, 1473}, {19, 1376}, {20, 1200}, {21, 1116}, {22, 1034}, {23, 500}, 
 {24, 863}, {25, 745}, {26, 714}, {27, 617}, {28, 565}, {29, 506}, {30,
481}, 
 {31, 396}, {32, 377}, {33, 345}, {34, 291}, {35, 272}, {36, 263}, {37,
230}, 
 {38, 177}, {39, 153}, {40, 131}, {41, 134}, {42, 128}, {43, 118}, {44,
98}, 
 {45, 82}, {46, 74}, {47, 58}, {48, 51}, {49, 59}, {50, 40}, {51, 42}, 
 {52, 40}, {53, 36}, {54, 28}, {55, 22}, {56, 26}, {57, 21}, {58, 19}, 
 {59, 19}, {60, 12}, {61, 13}, {62, 13}, {63, 11}, {64, 6}, {65, 9},
{66, 8}, 
 {67, 9}, {68, 7}, {69, 5}, {70, 4}}

Clearly, and reliably, something bad is happenning at i==23. To get some
feel for why this happened we need to have a look at the implementation
of this distribution (which is itself quite correct, by the way). It is
found in the Statistics directory, DiscreteDistributions.m file of the
Mathematica distribution. I show the relevant code below.

geometric = Compile[{{p, _Real}},
  Module[{i=0}, While[Random[] > p, i++];  i]]

GeometricDistribution/: Random[GeometricDistribution[p_]] :=
geometric[p]

GeometricDistribution/: RandomArray[GeometricDistribution[p_], dim_] :=
  Module[{m, array},
    m = If[VectorQ[dim], Apply[Times, dim], dim];
    array = Table[geometric[p], {m}];
    If[VectorQ[dim] && Length[dim] > 1,
           Fold[Partition[#1, #2]&, array, Reverse[Drop[dim, 1]] ],
           array  ]
  ] /; (IntegerQ[dim] && dim > 0) || VectorQ[dim, (IntegerQ[#] && # >
0)&]

The problem, surprisingly, is in Random[]. Before trying to say what
goes wrong I'll say a bit about the two Mathematica random number
generators and how they are used internally. 

(i) There is a cellular automaton (CA) generator, written by Stephen
Wolfram. It generates pseudorandom bit sequences up to length 30, hence
can generate integers up to 2^30-1.

(ii) There is a subtract-with-borrow (SWB) generator, based on work by
Marsaglia and Zaman. It falls in the general class of linear
congruential pseudorandom generators. Ours will generate 31 bits at a
time.

Here is how these are used. For generating small integers we use CA. If
the integer must be less than n, the number of bits generated is
Ceiling[Log[2,n]] and if the number is too large we discard it and try
again.

For large integers, we string together as many 31-bit entities as
needed, all generated with SWB, and then take whatever xtra bits are
needed from CA.

For bignum reals we use CA exclusively.

For machine reals (the case of interest in this note) we use SWB
exclusively. In particular we generate one such, divide by the largest
machine int of the same size and cast to machine double (so we have a
number in the range [0,1)), generate another, add them, and divide
again. The reason we need two such is to fill out the (typically) 53
bits of a machine double. I am simplifying a bit, the code is smart
enough to generate more if the native double precision structure so
requires. But I am aware of no platform for which this is the case.

Around August of '99 we first saw a draft of some benchmark statistical
testing under development at NIST. Among other things there were several
random number generator tests. These tests worked with large arrays of
pseudorandom 31 bit integers and machine doubles, and, as per above,
both of these sets are generated entirely with SWB in Mathematica. The
particular battery of tests is known by the name DIEHARD, and it was
developed by George Marsaglia. Somewhere in his web site we even found
some information to the effect that SWB generators are not going to pass
all of these tests. As it happened, Mathematica passed 13 of 15, which
is quite good, by the way. Not being good enough for us, however, we
looked further (that's how I know about the web site information) and we
found that the CA generator passes everything. Flawlessly. We tested
this by rigging a generator based on CA, that would generate appropriate
arrays for the test suite, which we then ran ourselves (I think it is
available from Marsaglia's site as well).

The specific tests that fail are known as the Birthday Spacings test
(based on a generalization of the so-called "birthday paradox") and the
Craps test. Though I am not certain, I believe the failure seen above is
related to that latter.

Since last August we learned of a couple of other failures, both in SWB.
At least one other was clearly, like that above, in the "tail" region of
the distribution; this is good news in that it would be more serious
were it elsewhere. The other failure only is apparent in very large big
integers, and I do not know exactly how to categorize it.

A reasonable question is, why do we use SWB? I do not know. Truth is,
the decision to do so predates my arrival at WRI (9/91) and also
predates our use of source-control software, so I can find no history.
version 2, my best guess is that it was installed because it is a bit
faster for generating arrays of doubles. But that is just speculation.

Another question is why we keep this SWB random number generator? The
answer, so far as I can discern, is that we want to maintain
replicability of results across releases. I am uncertain as to which
concern should get priority. Perhaps the best solution would be to have
a random number generator that is carved in store, others that can
change with improvements in technology, and a Method option to determine
which to use. Offhand I am not at all sure what direction we will take
with this.

The main issue is how one might, when necessary, work around the defects
in SWB. As noted above, CA never seems to fail, either on DIEHARD tests
or any other we've thrown its way. So one can simply withe a random
number generator for whatever class of numbers, using CA rather than
SWB. For example, to generate random machine doubles via CA, one can do

((Random[Integer,2^30-1]/2^30.)+
	Random[Integer,2^30-1])/2^30.

This works because, as noted above, CA is used to generate integers of
bit length 30 or less. AN alternative might be to generate a bignum and
coerce to machine precision, but that would not mesh well with Compile
(more on this below).

So how might we plug this into the various distributions in our add-on
statistics packages? Here is what I did for the geometric distribution.

Clear[Statistics`DiscreteDistributions`Private`geometric]

Statistics`DiscreteDistributions`Private`geometric =
  Compile[{{p,_Real}}, Module[{i=0},
    While[((Random[Integer,2^30-1]/2^30.)+
      Random[Integer,2^30-1])/2^30. > p, i++]; i]];

(I did it this way, rather than, say, defining a function myRandom[], so
that Compile would not require any function evaluations. Possibly there
is a cleaner way.)

So let's test this.

In[8]:= tg2 = RandomArray[GeometricDistribution[0.1], 10000];

In[9]:= InputForm[Table[{i,Count[tg2,i]}, {i,0,70}]]
Out[9]//InputForm= 
{{0, 1010}, {1, 863}, {2, 800}, {3, 686}, {4, 679}, {5, 585}, {6, 532}, 
 {7, 519}, {8, 436}, {9, 382}, {10, 339}, {11, 313}, {12, 269}, {13,
244}, 
 {14, 222}, {15, 187}, {16, 180}, {17, 169}, {18, 173}, {19, 126}, {20,
126}, 
 {21, 86}, {22, 106}, {23, 103}, {24, 77}, {25, 77}, {26, 64}, {27, 54}, 
 {28, 58}, {29, 49}, {30, 64}, {31, 46}, {32, 37}, {33, 35}, {34, 20}, 
 {35, 31}, {36, 19}, {37, 20}, {38, 27}, {39, 25}, {40, 30}, {41, 26}, 
 {42, 7}, {43, 14}, {44, 11}, {45, 6}, {46, 9}, {47, 8}, {48, 4}, {49,
9}, 
 {50, 4}, {51, 1}, {52, 7}, {53, 1}, {54, 1}, {55, 4}, {56, 1}, {57, 3}, 
 {58, 1}, {59, 1}, {60, 1}, {61, 1}, {62, 0}, {63, 0}, {64, 0}, {65, 3}, 
 {66, 1}, {67, 1}, {68, 1}, {69, 0}, {70, 2}}

10x larger test:

In[12]:= tg2 = RandomArray[GeometricDistribution[0.1], 100000];

In[13]:= InputForm[Table[{i,Count[tg2,i]}, {i,0,70}]]
Out[13]//InputForm= 
{{0, 9993}, {1, 9165}, {2, 8106}, {3, 7204}, {4, 6617}, {5, 5814}, {6,
5286}, 
 {7, 4722}, {8, 4240}, {9, 3876}, {10, 3531}, {11, 3197}, {12, 2858}, 
 {13, 2587}, {14, 2276}, {15, 1986}, {16, 1844}, {17, 1733}, {18, 1496}, 
 {19, 1367}, {20, 1170}, {21, 1109}, {22, 972}, {23, 855}, {24, 813}, 
 {25, 736}, {26, 669}, {27, 563}, {28, 573}, {29, 474}, {30, 409}, {31,
410}, 
 {32, 325}, {33, 293}, {34, 258}, {35, 260}, {36, 232}, {37, 187}, {38,
178}, 
 {39, 143}, {40, 168}, {41, 139}, {42, 120}, {43, 109}, {44, 110}, {45,
92}, 
 {46, 73}, {47, 62}, {48, 67}, {49, 51}, {50, 41}, {51, 35}, {52, 43}, 
 {53, 33}, {54, 36}, {55, 31}, {56, 20}, {57, 23}, {58, 24}, {59, 16}, 
 {60, 17}, {61, 20}, {62, 15}, {63, 23}, {64, 13}, {65, 11}, {66, 8}, 
 {67, 7}, {68, 5}, {69, 2}, {70, 10}}

While eyeballing the result is no substitute for a statistical check,
clearly this is looking alot healthier.


Daniel Lichtblau
Wolfram Research
(not speaking for my employer)
References:
- Random Geometric
  - From: bernd@bio.vu.nl (Bernd Brandt)
Prev by Date: results of pair creation contest
Next by Date: RE: Color Fill areas in 2D graphic
Previous by thread: Random Geometric
Next by thread: The most efficient Fibonacci algorithim?