MathGroup Archive: March 2003 [00035]

[Date Index] [Thread Index] [Author Index]

Re: Re: stats

To: mathgroup at smc.vnet.net
Subject: [mg39690] Re: [mg39680] [mg39680] Re: stats
From: Dr Bob <drbob at bigfoot.com>
Date: Sat, 1 Mar 2003 22:05:16 -0500 (EST)
References: <200303010748.CAA09912@smc.vnet.net>
Reply-to: drbob at bigfoot.com
Sender: owner-wri-mathgroup at wolfram.com

The Anderson-Darling, K-S, and Chi-square goodness of fit tests give an 
objective measure, however flawed.  A plot can yield, at best, a gut 
feeling.

That said, I like to have a gut feeling before spending time on objective 
measures.

Bobby

On Sat, 1 Mar 2003 02:48:57 -0500 (EST), Bill Rowe <listuser at earthlink.net> 
wrote:

> On 2/27/03 at 12:27 AM, marc.noel at skynet.be (Marc Noël) wrote:
>
>
>> I would like to know if any one has written a small app that allows to
>> tell mathemetica to look ar a set of data and output an answer like
>> "with a degree of confidence of X% this data can be considered as
>> normaly or log normaly or student, ... distributed. If not, can
>> anybody tell me how to use the tools available to acheive that kind of
>> result.
>
> This is actually a rather difficult problem. None of the tools in the 
> standard Mathematica distribution are designed to address this problem 
> directly.
>
> There are two approaches to solving this problem. First, you design a 
> test to compare your data to a specific distribution.  For example, 
> consider a normal distribution. The skewness is 0 and the kurtosis is 3. 
> If the kurtosis and skewness for your data set were significantly 
> different, that would be evidence your data is non-normal. Confidence 
> limits could be estimated from the sampling distribution of the kurtosis 
> and skewness. Using statistics like kurtosis and skewness almost 
> certainly isn't optimal particularly for small samples. The problem is 
> these statistics involve high order moments and are strongly affected by 
> outliers in small data sets.
>
> The other basic approach is to compute a statistic like the Kolmgorov- 
> Smirnov statistic. While this statistic is much more robust against 
> outliers, it is also much less efficient than a specific test tailored to 
> a specific distribution. It is probably a better choice than kurtosis and 
> skewness for small data sets. The distribution of the KS statistic is 
> known allowing for estimation confidence limits.
>
>> In principle I think one should first look at the data distribution
>> than compare it to a standard (normal, log normal, ...)
>
> If I take "look" to mean plot the data distribution, then it is possible 
> to do both of these steps at once. In fact, plotting the data 
> distribution in an appropriate manner is probably far better than simply 
> computing a test statistic and estimating confidence limits. The general 
> idea is to construct a Q-Q plot of the data.
>
> Again using a normal distribution as an example your could do the 
> following,
>
> d={First[#],Length[#]}&/@Split[Sort[data]];
> f=Rest[FoldList[Plus,0,Last/@d]]/(Length[data]+1);
> ListPlot[MapThread[{First[#1],Quantile[NormalDistribution[0,1],#2}]&,{d,f}]];
>
> What I've done here is compute the empirical cumulative distribution 
> function for your data in f and plotted this against quantiles of the 
> unit normal distribution. If the data is normally distributed, this 
> should plot as a straight line. Significant deviations would be an 
> indication of non-normality.
>
> Basically, the idea is to plot quantiles computed from your data against 
> the expected quantiles of a given distribution. If the given distribution 
> is a good model for the data, then the resulting plot should be a 
> straight line.
>
> Assuming you lack any reason to select a particular distribution, 
> plotting the data in this manner is probably always the best choice 
> unless you have a too many data sets to make this managable. Even if you 
> do have reason to select a given distribution plotting data in this 
> manner is a good idea as a check on the data set.
>
> One last thought. While it is possible to base the choice of distribution 
> on a test statistic such as the KS statistic and confidence limits, this 
> should always be a method of last resort. It is far, far better to choose 
> the distribution based on knowledge of the physical problem and its 
> characteristics instead of value of some test statistic.
>
>



-- 
majort at cox-internet.com
Bobby R. Treat

References:
- Re: stats
  - From: Bill Rowe <listuser@earthlink.net>

Prev by Date: Re: Re: density plot

Next by Date: thickness of axes

Previous by thread: Re: stats

Next by thread: Re: Re: stats