MathGroup Archive: November 2011 [00084]

[Date Index] [Thread Index] [Author Index]

Re: Problems with DistributionFitTest

To: mathgroup at smc.vnet.net
Subject: [mg122639] Re: Problems with DistributionFitTest
From: DrMajorBob <btreat1 at austin.rr.com>
Date: Fri, 4 Nov 2011 06:00:36 -0500 (EST)
Delivered-to: l-mathgroup@mail-archive0.wolfram.com
References: <201111010502.AAA14754@smc.vnet.net>
Reply-to: drmajorbob at yahoo.com

I like this one:

numTests = 1000;
First@Timing[
    tests = Flatten[{{{0, 1}},
       Sort@Table[{t =
           DistributionFitTest@
            RandomVariate[NormalDistribution[], 10000],
          Boole[t <= .05]}, {numTests}]}, 1];
    f = Interpolation@
      Thread[{Range[0, numTests]/numTests, tests[[All, 1]]}];
    Print@Plot[f@x - x, {x, 0, 1}, ImageSize -> 400];
    Print@N@Mean@Rest@tests
    ] "seconds"

Bobby

On Thu, 03 Nov 2011 03:46:19 -0500, Barrie Stokes  
<Barrie.Stokes at newcastle.edu.au> wrote:

> Hi Felipe
>
> Can I beg to make a small clarification to Andy's response?
>
> The whole idea of p values and rejection of the Null Hypothesis  
> continues to be one in which people get tangled up in logical and  
> linguistic knots.
>
> An observed p value of does *not* allow one to make a *general* claim  
> like "about 3% of the time you can expect to get a test statistic like  
> the one you obtained or one even more extreme".
>
> Given the context of this p value, it's value being  0.0312946, i.e.,  
> less than 0.05, allows a frequentist-classical statistician to say that,  
> *on this occasion*, this observed p value enables me to reject the Null  
> Hypothesis (which is that the data are Gaussian) at the 5% significance  
> level, or some such equivalent phrase.
>
> The important thing here is that, *by construction*, p values are equal  
> to or less than 0.05 precisely 5% of the time *when the Null Hypothesis  
> holds, i.e., is in fact true*, or "under the Null Hypothesis", as it's  
> usually phrased.
>
> When one rejects the Null Hypothesis (having obtained a p value <=0.05,  
> one is in fact betting that, in so doing, you will only be wrong in so  
> doing 1 time in 20.
>
> If anyone doesn't like this explication, please note that I am a  
> Bayesian, s for me to explain a p value is like George Bush explaining  
> the meaning of the French word 'entrepreneur'.  :-)
>
> (Apparently GB once claimed that the trouble with the French is that  
> they don't have a word for 'entrepreneur'. Actually, they do.)
>
> You may find the following code (built on your original code) helpful -  
> run it as many times as your patience allows.
>
> numTests = 1000;
> resultsList = {};
> Do[
>  (data = RandomVariate[NormalDistribution[], 10000];
>   AppendTo[ resultsList, DistributionFitTest[data] ];
>   ), {numTests}
>  ]
> resultsList // Short
> Length[ Select[ resultsList, (s \[Function] s <= 0.05) ] ]/numTests //  N
>
> Cheers
>
> Barrie
>
>
>
>>>> On 02/11/2011 at 10:23 pm, in message  
>>>> <201111021123.GAA03608 at smc.vnet.net>,
> Andy Ross <andyr at wolfram.com> wrote:
>> This is exactly what you might expect.  The p-value from a hypothesis
>> test is itself a random variable. Under the null hypothesis the p-value
>> should follow a UniformDistribution[{0,1}].
>>
>> In your case, the null hypothesis is that the data have been drawn from
>> a normal distribution. What that p-value is really saying is that about
>> 3% of the time you can expect to get a test statistic like the one you
>> obtained or one even more extreme.
>>
>> Andy Ross
>> Wolfram Research
>>
>>
>> On 11/1/2011 12:02 AM, fd wrote:
>>> Dear Group
>>>
>>> I'm not a specialist in statistics, but I spoke to one who found this
>>> behaviour dubious.
>>>
>>> Before using DistributionFitTest I was doing some tests with the
>>> normal distribution, like this
>>>
>>> data = RandomVariate[NormalDistribution[], 10000];
>>>
>>> DistributionFitTest[data]
>>>
>>> 0.0312946
>>>
>>> According to the documentation "A small p-value suggests that it is
>>> unlikely that the data came from dist", and that the test assumes the
>>> data is normally distributed
>>>
>>> I found this result for the p-value to be really low, if I re-run the
>>> code I often get what I would expect (a number greater than 0.5) but
>>> it is not at all rare to obtain p values smaller than 0.05 and even
>>> smaller. Through multiple re-runs I notice it fluctuates by orders of
>>> magnitude.
>>>
>>> The statistician I consulted with found this weird since the data was
>>> drawn from a a normal distribution and the sample size is big,
>>> especially because the Pearson X2 test also fluctuates like this:
>>>
>>> H=DistributionFitTest[data, Automatic, "HypothesisTestData"];
>>>
>>> H["TestDataTable", All]
>>>
>>> Is this a real issue?
>>>
>>> Any thougths
>>>
>>> Best regards
>>> Felipe
>>>
>>>
>>>
>>>
>
>


-- 
DrMajorBob at yahoo.com

Prev by Date: Re: Simple DSolve equation

Next by Date: Re: nVidia Optumus prevents using CUDA?

Previous by thread: Re: Problems with DistributionFitTest

Next by thread: Re: Integral points on elliptic curves