Re: stats

*To*: mathgroup at smc.vnet.net*Subject*: [mg39680] Re: stats*From*: Bill Rowe <listuser at earthlink.net>*Date*: Fri, 28 Feb 2003 04:50:04 -0500 (EST)*Sender*: owner-wri-mathgroup at wolfram.com

On 2/27/03 at 12:27 AM, marc.noel at skynet.be (Marc Noël) wrote: >I would like to know if any one has written a small app that allows to >tell mathemetica to look ar a set of data and output an answer like >"with a degree of confidence of X% this data can be considered as >normaly or log normaly or student, ... distributed. If not, can >anybody tell me how to use the tools available to acheive that kind of >result. This is actually a rather difficult problem. None of the tools in the standard Mathematica distribution are designed to address this problem directly. There are two approaches to solving this problem. First, you design a test to compare your data to a specific distribution. For example, consider a normal distribution. The skewness is 0 and the kurtosis is 3. If the kurtosis and skewness for your data set were significantly different, that would be evidence your data is non-normal. Confidence limits could be estimated from the sampling distribution of the kurtosis and skewness. Using statistics like kurtosis and skewness almost certainly isn't optimal particularly for small samples. The problem is these statistics involve high order moments and are strongly affected by outliers in small data sets. The other basic approach is to compute a statistic like the Kolmgorov-Smirnov statistic. While this statistic is much more robust against outliers, it is also much less efficient than a specific test tailored to a specific distribution. It is probably a better choice than kurtosis and skewness for small data sets. The distribution of the KS statistic is known allowing for estimation confidence limits. >In principle I think one should first look at the data distribution >than compare it to a standard (normal, log normal, ...) If I take "look" to mean plot the data distribution, then it is possible to do both of these steps at once. In fact, plotting the data distribution in an appropriate manner is probably far better than simply computing a test statistic and estimating confidence limits. The general idea is to construct a Q-Q plot of the data. Again using a normal distribution as an example your could do the following, d={First[#],Length[#]}&/@Split[Sort[data]]; f=Rest[FoldList[Plus,0,Last/@d]]/(Length[data]+1); ListPlot[MapThread[{First[#1],Quantile[NormalDistribution[0,1],#2}]&,{d,f}]]; What I've done here is compute the empirical cumulative distribution function for your data in f and plotted this against quantiles of the unit normal distribution. If the data is normally distributed, this should plot as a straight line. Significant deviations would be an indication of non-normality. Basically, the idea is to plot quantiles computed from your data against the expected quantiles of a given distribution. If the given distribution is a good model for the data, then the resulting plot should be a straight line. Assuming you lack any reason to select a particular distribution, plotting the data in this manner is probably always the best choice unless you have a too many data sets to make this managable. Even if you do have reason to select a given distribution plotting data in this manner is a good idea as a check on the data set. One last thought. While it is possible to base the choice of distribution on a test statistic such as the KS statistic and confidence limits, this should always be a method of last resort. It is far, far better to choose the distribution based on knowledge of the physical problem and its characteristics instead of value of some test statistic.