Re: Re: stats
- To: mathgroup at smc.vnet.net
- Subject: [mg39690] Re: [mg39680] [mg39680] Re: stats
- From: Dr Bob <drbob at bigfoot.com>
- Date: Sat, 1 Mar 2003 22:05:16 -0500 (EST)
- References: <200303010748.CAA09912@smc.vnet.net>
- Reply-to: drbob at bigfoot.com
- Sender: owner-wri-mathgroup at wolfram.com
The Anderson-Darling, K-S, and Chi-square goodness of fit tests give an
objective measure, however flawed. A plot can yield, at best, a gut
feeling.
That said, I like to have a gut feeling before spending time on objective
measures.
Bobby
On Sat, 1 Mar 2003 02:48:57 -0500 (EST), Bill Rowe <listuser at earthlink.net>
wrote:
> On 2/27/03 at 12:27 AM, marc.noel at skynet.be (Marc Noël) wrote:
>
>
>> I would like to know if any one has written a small app that allows to
>> tell mathemetica to look ar a set of data and output an answer like
>> "with a degree of confidence of X% this data can be considered as
>> normaly or log normaly or student, ... distributed. If not, can
>> anybody tell me how to use the tools available to acheive that kind of
>> result.
>
> This is actually a rather difficult problem. None of the tools in the
> standard Mathematica distribution are designed to address this problem
> directly.
>
> There are two approaches to solving this problem. First, you design a
> test to compare your data to a specific distribution. For example,
> consider a normal distribution. The skewness is 0 and the kurtosis is 3.
> If the kurtosis and skewness for your data set were significantly
> different, that would be evidence your data is non-normal. Confidence
> limits could be estimated from the sampling distribution of the kurtosis
> and skewness. Using statistics like kurtosis and skewness almost
> certainly isn't optimal particularly for small samples. The problem is
> these statistics involve high order moments and are strongly affected by
> outliers in small data sets.
>
> The other basic approach is to compute a statistic like the Kolmgorov-
> Smirnov statistic. While this statistic is much more robust against
> outliers, it is also much less efficient than a specific test tailored to
> a specific distribution. It is probably a better choice than kurtosis and
> skewness for small data sets. The distribution of the KS statistic is
> known allowing for estimation confidence limits.
>
>> In principle I think one should first look at the data distribution
>> than compare it to a standard (normal, log normal, ...)
>
> If I take "look" to mean plot the data distribution, then it is possible
> to do both of these steps at once. In fact, plotting the data
> distribution in an appropriate manner is probably far better than simply
> computing a test statistic and estimating confidence limits. The general
> idea is to construct a Q-Q plot of the data.
>
> Again using a normal distribution as an example your could do the
> following,
>
> d={First[#],Length[#]}&/@Split[Sort[data]];
> f=Rest[FoldList[Plus,0,Last/@d]]/(Length[data]+1);
> ListPlot[MapThread[{First[#1],Quantile[NormalDistribution[0,1],#2}]&,{d,f}]];
>
> What I've done here is compute the empirical cumulative distribution
> function for your data in f and plotted this against quantiles of the
> unit normal distribution. If the data is normally distributed, this
> should plot as a straight line. Significant deviations would be an
> indication of non-normality.
>
> Basically, the idea is to plot quantiles computed from your data against
> the expected quantiles of a given distribution. If the given distribution
> is a good model for the data, then the resulting plot should be a
> straight line.
>
> Assuming you lack any reason to select a particular distribution,
> plotting the data in this manner is probably always the best choice
> unless you have a too many data sets to make this managable. Even if you
> do have reason to select a given distribution plotting data in this
> manner is a good idea as a check on the data set.
>
> One last thought. While it is possible to base the choice of distribution
> on a test statistic such as the KS statistic and confidence limits, this
> should always be a method of last resort. It is far, far better to choose
> the distribution based on knowledge of the physical problem and its
> characteristics instead of value of some test statistic.
>
>
--
majort at cox-internet.com
Bobby R. Treat
- References:
- Re: stats
- From: Bill Rowe <listuser@earthlink.net>
- Re: stats