Re: Warning -- another Random[] failure
- To: mathgroup at smc.vnet.net
- Subject: [mg42051] Re: [mg42014] Warning -- another Random[] failure
- From: Bobby Treat <drmajorbob at mailblocks.com>
- Date: Tue, 17 Jun 2003 05:43:46 -0400 (EDT)
- Sender: owner-wri-mathgroup at wolfram.com
This gets rid of most of the problem: F[1, n2_] := Table[1, {n2 + 1}]; F[n1_, n2_] /; n1 > n2 := F[n2, n1] F[n1_, n2_] /; n1 <= n2 := F[n1, n2] = Join[Table[0, {n2}], F[ n1 - 1, n2]] + Join[F[n1, n2 - 1], Table[0, {n1}]] n1 = n2 = 14; reps = 2*10^5; f = Table[0, {n1*n2 + 1}]; L = 1 - n1(n1 + 1)/2; myRandom := ((Random[Integer, 2^30 - 1]/2^30.) + Random[Integer, 2^30 - 1])/2^30. Do[i = First[Plus @@ Position[ Last[Transpose[Sort[Join[Table[{myRandom, 0}, {n1}], Table[{myRandom, 1}, {n2}]]]]], 0]] + L; f[[i]]++, {reps}] Bobby -----Original Message----- From: Ray Koopman <koopman at sfu.ca> To: mathgroup at smc.vnet.net Subject: [mg42051] [mg42014] Warning -- another Random[] failure SUMMARY In the course of some Monte Carlo work with the Mann-Whitney U statistic, I have encountered another failure of Random[]: the empirical sampling distribution of U in the continuous null case differs systematically from its theoretical distribution whenever the sample sizes are large enough, where "large enough" depends on how many calls to Random[] each sample value needs. If each value uses only a single Random[] then the empirical distribution of U will be bad if n1 + n2 > 24; if each value needs two calls to Random[] -- e.g., if we use Random[]-Random[] to sample from a symmetric triangular distribution on (-1,1) -- then the 24 becomes 12; etc. DETAILS The following code gets the exact theoretical frequency distribution of U, normalized to sum to Multinomial[n1,n2]. It assumes n1 <= n2. F[1,n2_] := Table[1,{n2+1}]; F[n1_,n2_] := F[n1,n2] = Join[Table[0,{n2}],F[n1-1,n2]] + Join[F[Min[n1,n2-1],Max[n1,n2-1]],Table[0,{n1}]] The following code creates an empirical frequency distribution of U for "reps" pairs of samples, of sizes n1 and n2, from a Uniform(0,1) population. Note that it first gets Wilcoxon's rank-sum statistic, then converts to U+1 by adding L. f = Table[0,{n1*n2+1}]; L = 1 - n1(n1+1)/2; Do[i = First[Plus@@Position[Last[Transpose[Sort[Join[ Table[{Random[],0},{n1}], Table[{Random[],1},{n2}]]]]],0]] + L; f[[i]]++, {reps}] The expected frequencies are e = F[n1,n2]*reps/Plus@@F[n1,n2]; the Pearson residuals and chi-square (df = n1*n2) are z = (f-e)/Sqrt[e]; chisqr = N[z.z] To see the problem, ListPlot[Transpose[{Range[0,n1*n2],z}], PlotRange->All, Frame->True] The n1*n2+1 residuals in z should approximate independent standard normal deviates, and the display should look "random", with no structure. But consider the following empirical distribution, obtained from the above code with n1 = n2 = 14, reps = 2*10^6: f = {0,0,0,0,1,0,1,0,3,2,6,7,3,3,3,7,14,11,19,17,16,26,45, 47,57,85,92,127,133,142,203,207,270,338,375,416,492,568, 688,787,924,1065,1203,1391,1528,1709,2040,2199,2477,2738, 3152,3502,3825,4276,4787,5326,5830,6312,6922,7599,8168, 8752,9532,10181,11097,11537,12616,13406,14399,15207,16291, 17092,18196,19125,20027,21398,22428,23380,24112,25098, 26199,27319,28448,29375,29922,30638,31633,32225,32758, 33652,34180,34892,35479,35450,35968,35942,36389,36431, 36628,36449,36431,35824,35863,35534,35247,34390,34110, 33735,33158,32069,31358,30521,29859,28884,28133,27173, 26164,25055,24438,23256,22263,21075,20180,19540,18248, 17170, 16064,15327,14473,13442,12662,11607,11121,10230, 9441,8809,8091,7415,6848,6193,5669,5187,4660,4213,3732, 3408,3102,2830,2484,2196,1992,1763,1536,1367,1232,1029, 883,821,682,603,507,443,386,328,269,202,186,155,130,96, 99,69,69,56,34,31,26,25,15,11,15,4,5,3,5,2,2,1,1,0,0,0, 0,0,0,0,0}; This f gives chisqr = 653.677, which is far too big, even after allowing for expecteds that are less than 5. More importantly, the residual plot is strongly patterned: it looks like a "w", and this shape seems to characterize these plots whenever the sample sizes are as noted above and the number of reps is large enough to let the effect be seen.