Re: Random Geometric (long)

*To*: mathgroup at smc.vnet.net*Subject*: [mg23392] Re: [mg23340] Random Geometric (long)*From*: Daniel Lichtblau <danl at wolfram.com>*Date*: Fri, 5 May 2000 02:07:30 -0400 (EDT)*References*: <200005040659.CAA17403@smc.vnet.net>*Sender*: owner-wri-mathgroup at wolfram.com

Bernd Brandt wrote: > > Dear All, > > With the random function, it is possible to take random numbers from a distribution. > I tried the GeometricDistribution and found something `strange', i think. > > The Plot of the frequency of numbers shows a lower number for 23: > > tg = Table[Random[GeometricDistribution[0.1]], {i, 1, 10000}]; > > Table[{i, Count[tg, i]}, {i, 21, 24}] > {{21, 95}, {22, 98}, {23, 48}, {24, 93}} > > {{21, 1102}, {22, 1015}, {23, 540}, {24, 877} for 100.000 Random numbers. > > Is this indeed something strange for a Random generator? > It still occurs with more Random numbers, and remains with repeated > evaluation of the input cell (the counts change, but 23 remains much lower). > > Regards, > Bernd I'll explain as best I can what is going wrong, and how to work around it. First let's explicitly replicate the problem. I'll use RandomArray for convenience. In[1]:= <<Statistics` In[2]:= tg1 = RandomArray[GeometricDistribution[0.1], 10000]; In[3]:= InputForm[Table[{i,Count[tg1,i]}, {i,0,70}]] Out[3]//InputForm= {{0, 990}, {1, 894}, {2, 794}, {3, 771}, {4, 636}, {5, 570}, {6, 540}, {7, 479}, {8, 397}, {9, 389}, {10, 341}, {11, 305}, {12, 271}, {13, 274}, {14, 243}, {15, 220}, {16, 215}, {17, 166}, {18, 176}, {19, 124}, {20, 117}, {21, 97}, {22, 109}, {23, 49}, {24, 109}, {25, 79}, {26, 78}, {27, 56}, {28, 48}, {29, 39}, {30, 48}, {31, 43}, {32, 33}, {33, 41}, {34, 22}, {35, 23}, {36, 23}, {37, 11}, {38, 19}, {39, 18}, {40, 15}, {41, 16}, {42, 8}, {43, 14}, {44, 14}, {45, 10}, {46, 3}, {47, 5}, {48, 10}, {49, 5}, {50, 3}, {51, 7}, {52, 3}, {53, 5}, {54, 4}, {55, 6}, {56, 3}, {57, 0}, {58, 1}, {59, 0}, {60, 0}, {61, 1}, {62, 0}, {63, 3}, {64, 0}, {65, 1}, {66, 0}, {67, 1}, {68, 1}, {69, 2}, {70, 0}} 10x larger test: In[4]:= tg1 = RandomArray[GeometricDistribution[0.1], 100000]; In[5]:= InputForm[Table[{i,Count[tg1,i]}, {i,0,70}]] Out[5]//InputForm= {{0, 10113}, {1, 9173}, {2, 7937}, {3, 7239}, {4, 6600}, {5, 5812}, {6, 5360}, {7, 4739}, {8, 4156}, {9, 3952}, {10, 3476}, {11, 3105}, {12, 2817}, {13, 2566}, {14, 2358}, {15, 2026}, {16, 1857}, {17, 1639}, {18, 1473}, {19, 1376}, {20, 1200}, {21, 1116}, {22, 1034}, {23, 500}, {24, 863}, {25, 745}, {26, 714}, {27, 617}, {28, 565}, {29, 506}, {30, 481}, {31, 396}, {32, 377}, {33, 345}, {34, 291}, {35, 272}, {36, 263}, {37, 230}, {38, 177}, {39, 153}, {40, 131}, {41, 134}, {42, 128}, {43, 118}, {44, 98}, {45, 82}, {46, 74}, {47, 58}, {48, 51}, {49, 59}, {50, 40}, {51, 42}, {52, 40}, {53, 36}, {54, 28}, {55, 22}, {56, 26}, {57, 21}, {58, 19}, {59, 19}, {60, 12}, {61, 13}, {62, 13}, {63, 11}, {64, 6}, {65, 9}, {66, 8}, {67, 9}, {68, 7}, {69, 5}, {70, 4}} Clearly, and reliably, something bad is happenning at i==23. To get some feel for why this happened we need to have a look at the implementation of this distribution (which is itself quite correct, by the way). It is found in the Statistics directory, DiscreteDistributions.m file of the Mathematica distribution. I show the relevant code below. geometric = Compile[{{p, _Real}}, Module[{i=0}, While[Random[] > p, i++]; i]] GeometricDistribution/: Random[GeometricDistribution[p_]] := geometric[p] GeometricDistribution/: RandomArray[GeometricDistribution[p_], dim_] := Module[{m, array}, m = If[VectorQ[dim], Apply[Times, dim], dim]; array = Table[geometric[p], {m}]; If[VectorQ[dim] && Length[dim] > 1, Fold[Partition[#1, #2]&, array, Reverse[Drop[dim, 1]] ], array ] ] /; (IntegerQ[dim] && dim > 0) || VectorQ[dim, (IntegerQ[#] && # > 0)&] The problem, surprisingly, is in Random[]. Before trying to say what goes wrong I'll say a bit about the two Mathematica random number generators and how they are used internally. (i) There is a cellular automaton (CA) generator, written by Stephen Wolfram. It generates pseudorandom bit sequences up to length 30, hence can generate integers up to 2^30-1. (ii) There is a subtract-with-borrow (SWB) generator, based on work by Marsaglia and Zaman. It falls in the general class of linear congruential pseudorandom generators. Ours will generate 31 bits at a time. Here is how these are used. For generating small integers we use CA. If the integer must be less than n, the number of bits generated is Ceiling[Log[2,n]] and if the number is too large we discard it and try again. For large integers, we string together as many 31-bit entities as needed, all generated with SWB, and then take whatever xtra bits are needed from CA. For bignum reals we use CA exclusively. For machine reals (the case of interest in this note) we use SWB exclusively. In particular we generate one such, divide by the largest machine int of the same size and cast to machine double (so we have a number in the range [0,1)), generate another, add them, and divide again. The reason we need two such is to fill out the (typically) 53 bits of a machine double. I am simplifying a bit, the code is smart enough to generate more if the native double precision structure so requires. But I am aware of no platform for which this is the case. Around August of '99 we first saw a draft of some benchmark statistical testing under development at NIST. Among other things there were several random number generator tests. These tests worked with large arrays of pseudorandom 31 bit integers and machine doubles, and, as per above, both of these sets are generated entirely with SWB in Mathematica. The particular battery of tests is known by the name DIEHARD, and it was developed by George Marsaglia. Somewhere in his web site we even found some information to the effect that SWB generators are not going to pass all of these tests. As it happened, Mathematica passed 13 of 15, which is quite good, by the way. Not being good enough for us, however, we looked further (that's how I know about the web site information) and we found that the CA generator passes everything. Flawlessly. We tested this by rigging a generator based on CA, that would generate appropriate arrays for the test suite, which we then ran ourselves (I think it is available from Marsaglia's site as well). The specific tests that fail are known as the Birthday Spacings test (based on a generalization of the so-called "birthday paradox") and the Craps test. Though I am not certain, I believe the failure seen above is related to that latter. Since last August we learned of a couple of other failures, both in SWB. At least one other was clearly, like that above, in the "tail" region of the distribution; this is good news in that it would be more serious were it elsewhere. The other failure only is apparent in very large big integers, and I do not know exactly how to categorize it. A reasonable question is, why do we use SWB? I do not know. Truth is, the decision to do so predates my arrival at WRI (9/91) and also predates our use of source-control software, so I can find no history. version 2, my best guess is that it was installed because it is a bit faster for generating arrays of doubles. But that is just speculation. Another question is why we keep this SWB random number generator? The answer, so far as I can discern, is that we want to maintain replicability of results across releases. I am uncertain as to which concern should get priority. Perhaps the best solution would be to have a random number generator that is carved in store, others that can change with improvements in technology, and a Method option to determine which to use. Offhand I am not at all sure what direction we will take with this. The main issue is how one might, when necessary, work around the defects in SWB. As noted above, CA never seems to fail, either on DIEHARD tests or any other we've thrown its way. So one can simply withe a random number generator for whatever class of numbers, using CA rather than SWB. For example, to generate random machine doubles via CA, one can do ((Random[Integer,2^30-1]/2^30.)+ Random[Integer,2^30-1])/2^30. This works because, as noted above, CA is used to generate integers of bit length 30 or less. AN alternative might be to generate a bignum and coerce to machine precision, but that would not mesh well with Compile (more on this below). So how might we plug this into the various distributions in our add-on statistics packages? Here is what I did for the geometric distribution. Clear[Statistics`DiscreteDistributions`Private`geometric] Statistics`DiscreteDistributions`Private`geometric = Compile[{{p,_Real}}, Module[{i=0}, While[((Random[Integer,2^30-1]/2^30.)+ Random[Integer,2^30-1])/2^30. > p, i++]; i]]; (I did it this way, rather than, say, defining a function myRandom[], so that Compile would not require any function evaluations. Possibly there is a cleaner way.) So let's test this. In[8]:= tg2 = RandomArray[GeometricDistribution[0.1], 10000]; In[9]:= InputForm[Table[{i,Count[tg2,i]}, {i,0,70}]] Out[9]//InputForm= {{0, 1010}, {1, 863}, {2, 800}, {3, 686}, {4, 679}, {5, 585}, {6, 532}, {7, 519}, {8, 436}, {9, 382}, {10, 339}, {11, 313}, {12, 269}, {13, 244}, {14, 222}, {15, 187}, {16, 180}, {17, 169}, {18, 173}, {19, 126}, {20, 126}, {21, 86}, {22, 106}, {23, 103}, {24, 77}, {25, 77}, {26, 64}, {27, 54}, {28, 58}, {29, 49}, {30, 64}, {31, 46}, {32, 37}, {33, 35}, {34, 20}, {35, 31}, {36, 19}, {37, 20}, {38, 27}, {39, 25}, {40, 30}, {41, 26}, {42, 7}, {43, 14}, {44, 11}, {45, 6}, {46, 9}, {47, 8}, {48, 4}, {49, 9}, {50, 4}, {51, 1}, {52, 7}, {53, 1}, {54, 1}, {55, 4}, {56, 1}, {57, 3}, {58, 1}, {59, 1}, {60, 1}, {61, 1}, {62, 0}, {63, 0}, {64, 0}, {65, 3}, {66, 1}, {67, 1}, {68, 1}, {69, 0}, {70, 2}} 10x larger test: In[12]:= tg2 = RandomArray[GeometricDistribution[0.1], 100000]; In[13]:= InputForm[Table[{i,Count[tg2,i]}, {i,0,70}]] Out[13]//InputForm= {{0, 9993}, {1, 9165}, {2, 8106}, {3, 7204}, {4, 6617}, {5, 5814}, {6, 5286}, {7, 4722}, {8, 4240}, {9, 3876}, {10, 3531}, {11, 3197}, {12, 2858}, {13, 2587}, {14, 2276}, {15, 1986}, {16, 1844}, {17, 1733}, {18, 1496}, {19, 1367}, {20, 1170}, {21, 1109}, {22, 972}, {23, 855}, {24, 813}, {25, 736}, {26, 669}, {27, 563}, {28, 573}, {29, 474}, {30, 409}, {31, 410}, {32, 325}, {33, 293}, {34, 258}, {35, 260}, {36, 232}, {37, 187}, {38, 178}, {39, 143}, {40, 168}, {41, 139}, {42, 120}, {43, 109}, {44, 110}, {45, 92}, {46, 73}, {47, 62}, {48, 67}, {49, 51}, {50, 41}, {51, 35}, {52, 43}, {53, 33}, {54, 36}, {55, 31}, {56, 20}, {57, 23}, {58, 24}, {59, 16}, {60, 17}, {61, 20}, {62, 15}, {63, 23}, {64, 13}, {65, 11}, {66, 8}, {67, 7}, {68, 5}, {69, 2}, {70, 10}} While eyeballing the result is no substitute for a statistical check, clearly this is looking alot healthier. Daniel Lichtblau Wolfram Research (not speaking for my employer)

**References**:**Random Geometric***From:*bernd@bio.vu.nl (Bernd Brandt)