MathGroup Archive 2003

[Date Index] [Thread Index] [Author Index]

Search the Archive

Re: Re: stats

  • To: mathgroup at smc.vnet.net
  • Subject: [mg39674] Re: Re: stats
  • From: Dr Bob <drbob at bigfoot.com>
  • Date: Sat, 1 Mar 2003 02:48:23 -0500 (EST)
  • References: <200302280950.EAA02989@smc.vnet.net>
  • Reply-to: drbob at bigfoot.com
  • Sender: owner-wri-mathgroup at wolfram.com

One might want to start at

http://www.itl.nist.gov/div898/handbook/eda/section3/eda35g.htm

Bobby

On Fri, 28 Feb 2003 04:50:04 -0500 (EST), Bill Rowe 
<listuser at earthlink.net> wrote:

> On 2/27/03 at 12:27 AM, marc.noel at skynet.be (Marc Noël) wrote:
>
>
>> I would like to know if any one has written a small app that allows to
>> tell mathemetica to look ar a set of data and output an answer like
>> "with a degree of confidence of X% this data can be considered as
>> normaly or log normaly or student, ... distributed. If not, can
>> anybody tell me how to use the tools available to acheive that kind of
>> result.
>
> This is actually a rather difficult problem. None of the tools in the 
> standard Mathematica distribution are designed to address this problem 
> directly.
>
> There are two approaches to solving this problem. First, you design a 
> test to compare your data to a specific distribution.  For example, 
> consider a normal distribution. The skewness is 0 and the kurtosis is 3. 
> If the kurtosis and skewness for your data set were significantly 
> different, that would be evidence your data is non-normal. Confidence 
> limits could be estimated from the sampling distribution of the kurtosis 
> and skewness. Using statistics like kurtosis and skewness almost 
> certainly isn't optimal particularly for small samples. The problem is 
> these statistics involve high order moments and are strongly affected by 
> outliers in small data sets.
>
> The other basic approach is to compute a statistic like the Kolmgorov- 
> Smirnov statistic. While this statistic is much more robust against 
> outliers, it is also much less efficient than a specific test tailored to 
> a specific distribution. It is probably a better choice than kurtosis and 
> skewness for small data sets. The distribution of the KS statistic is 
> known allowing for estimation confidence limits.
>
>> In principle I think one should first look at the data distribution
>> than compare it to a standard (normal, log normal, ...)
>
> If I take "look" to mean plot the data distribution, then it is possible 
> to do both of these steps at once. In fact, plotting the data 
> distribution in an appropriate manner is probably far better than simply 
> computing a test statistic and estimating confidence limits. The general 
> idea is to construct a Q-Q plot of the data.
>
> Again using a normal distribution as an example your could do the 
> following,
>
> d={First[#],Length[#]}&/@Split[Sort[data]];
> f=Rest[FoldList[Plus,0,Last/@d]]/(Length[data]+1);
> ListPlot[MapThread[{First[#1],Quantile[NormalDistribution[0,1],#2}]&,{d,f}]];
>
> What I've done here is compute the empirical cumulative distribution 
> function for your data in f and plotted this against quantiles of the 
> unit normal distribution. If the data is normally distributed, this 
> should plot as a straight line. Significant deviations would be an 
> indication of non-normality.
>
> Basically, the idea is to plot quantiles computed from your data against 
> the expected quantiles of a given distribution. If the given distribution 
> is a good model for the data, then the resulting plot should be a 
> straight line.
>
> Assuming you lack any reason to select a particular distribution, 
> plotting the data in this manner is probably always the best choice 
> unless you have a too many data sets to make this managable. Even if you 
> do have reason to select a given distribution plotting data in this 
> manner is a good idea as a check on the data set.
>
> One last thought. While it is possible to base the choice of distribution 
> on a test statistic such as the KS statistic and confidence limits, this 
> should always be a method of last resort. It is far, far better to choose 
> the distribution based on knowledge of the physical problem and its 
> characteristics instead of value of some test statistic.
>
>



-- 
majort at cox-internet.com
Bobby R. Treat



  • Prev by Date: Re: stats
  • Next by Date: Output
  • Previous by thread: Re: Re: stats
  • Next by thread: Output