Re: Re: stats

*To*: mathgroup at smc.vnet.net*Subject*: [mg39690] Re: [mg39680] [mg39680] Re: stats*From*: Dr Bob <drbob at bigfoot.com>*Date*: Sat, 1 Mar 2003 22:05:16 -0500 (EST)*References*: <200303010748.CAA09912@smc.vnet.net>*Reply-to*: drbob at bigfoot.com*Sender*: owner-wri-mathgroup at wolfram.com

The Anderson-Darling, K-S, and Chi-square goodness of fit tests give an objective measure, however flawed. A plot can yield, at best, a gut feeling. That said, I like to have a gut feeling before spending time on objective measures. Bobby On Sat, 1 Mar 2003 02:48:57 -0500 (EST), Bill Rowe <listuser at earthlink.net> wrote: > On 2/27/03 at 12:27 AM, marc.noel at skynet.be (Marc Noël) wrote: > > >> I would like to know if any one has written a small app that allows to >> tell mathemetica to look ar a set of data and output an answer like >> "with a degree of confidence of X% this data can be considered as >> normaly or log normaly or student, ... distributed. If not, can >> anybody tell me how to use the tools available to acheive that kind of >> result. > > This is actually a rather difficult problem. None of the tools in the > standard Mathematica distribution are designed to address this problem > directly. > > There are two approaches to solving this problem. First, you design a > test to compare your data to a specific distribution. For example, > consider a normal distribution. The skewness is 0 and the kurtosis is 3. > If the kurtosis and skewness for your data set were significantly > different, that would be evidence your data is non-normal. Confidence > limits could be estimated from the sampling distribution of the kurtosis > and skewness. Using statistics like kurtosis and skewness almost > certainly isn't optimal particularly for small samples. The problem is > these statistics involve high order moments and are strongly affected by > outliers in small data sets. > > The other basic approach is to compute a statistic like the Kolmgorov- > Smirnov statistic. While this statistic is much more robust against > outliers, it is also much less efficient than a specific test tailored to > a specific distribution. It is probably a better choice than kurtosis and > skewness for small data sets. The distribution of the KS statistic is > known allowing for estimation confidence limits. > >> In principle I think one should first look at the data distribution >> than compare it to a standard (normal, log normal, ...) > > If I take "look" to mean plot the data distribution, then it is possible > to do both of these steps at once. In fact, plotting the data > distribution in an appropriate manner is probably far better than simply > computing a test statistic and estimating confidence limits. The general > idea is to construct a Q-Q plot of the data. > > Again using a normal distribution as an example your could do the > following, > > d={First[#],Length[#]}&/@Split[Sort[data]]; > f=Rest[FoldList[Plus,0,Last/@d]]/(Length[data]+1); > ListPlot[MapThread[{First[#1],Quantile[NormalDistribution[0,1],#2}]&,{d,f}]]; > > What I've done here is compute the empirical cumulative distribution > function for your data in f and plotted this against quantiles of the > unit normal distribution. If the data is normally distributed, this > should plot as a straight line. Significant deviations would be an > indication of non-normality. > > Basically, the idea is to plot quantiles computed from your data against > the expected quantiles of a given distribution. If the given distribution > is a good model for the data, then the resulting plot should be a > straight line. > > Assuming you lack any reason to select a particular distribution, > plotting the data in this manner is probably always the best choice > unless you have a too many data sets to make this managable. Even if you > do have reason to select a given distribution plotting data in this > manner is a good idea as a check on the data set. > > One last thought. While it is possible to base the choice of distribution > on a test statistic such as the KS statistic and confidence limits, this > should always be a method of last resort. It is far, far better to choose > the distribution based on knowledge of the physical problem and its > characteristics instead of value of some test statistic. > > -- majort at cox-internet.com Bobby R. Treat

**References**:**Re: stats***From:*Bill Rowe <listuser@earthlink.net>