Re: R: I: Re: Kolmogorov Smirnov in two or more dimensions is in Mathematica 8.0.4

*To*: mathgroup at smc.vnet.net*Subject*: [mg124993] Re: R: I: Re: Kolmogorov Smirnov in two or more dimensions is in Mathematica 8.0.4*From*: Andy Ross <andyr at wolfram.com>*Date*: Thu, 16 Feb 2012 03:29:31 -0500 (EST)*Delivered-to*: l-mathgroup@mail-archive0.wolfram.com

I don't believe it is fixed in 8.0.4. I discovered the issue too late for the change to make it in. I can't stress enough that all of the tests based on the empirical distribution function are marginal tests. A marginal-based test can only be used to reject the null hypothesis of goodness of fit. It does not provide adequate evidence for a good fit to the joint distribution because dependency structure is missed. This is true whether you are comparing data to a distribution or two data sets. Here is a function that will compute marginal-based test statistics and p-values more quickly than the Monte-Carlo based ones that are in 8.0.4. I suggest you use this if such tests are adequate for your purposes. marginalMVTest[data1_, data2_, test_] := Block[{t, p, dim = Length[data1[[1]]]}, {t, p} = Transpose[ Table[DistributionFitTest[data1[[All, i]], data2[[All, i]], "TestData"], {i, dim}]]; {"Test" -> test, "T" -> Mean[t], "P-Value" -> CDF[UniformSumDistribution[dim], Total@p] ] As a first example take some data drawn from two very different distributions. data1 = data2 = RandomVariate[BinormalDistribution[.9], 100]; data2 = RandomVariate[BinormalDistribution[-.9], 100]; In[47]:= marginalMVTest[data1, data2, "KolmogorovSmirnov"] Out[47]= {"Test" -> "KolmogorovSmirnov", "T" -> 0.0421055, "P-Value" -> 0.917708} Notice we would claim that we cannot reject the null hypothesis that data1 and data2 were drawn from the same distribution. However, this is obviously an error. The reason this happens is because marginally, both distributions are NormalDistribution[0,1]. However, if distributions differ marginally they also differ jointly. Thus the following does what we want. data1 = RandomVariate[ProductDistribution[NormalDistribution[], NormalDistribution[2, 1]], 100]; data2 = RandomVariate[BinormalDistribution[0], 100]; In[79]:= marginalMVTest[data1, data2, "KolmogorovSmirnov"] Out[79]= {"Test" -> "KolmogorovSmirnov", "T" -> 0.973367, "P-Value" -> 0.00918411} As I said in the previous response, the Szekely-Energy test in DistributionFitTest will be a better comparison. You will just need to wait a while for it to compute. If you have two matrices data1 and data2 this can be accomplished via DistributionFitTest[data1, data2, {"TestDataTable","SzekelyEnergy"},Method->{"MonteCarlo","MonteCarloSamples"->1000}] I've shown this with sub-option "MonteCarloSamples". You can increase this number to improve the estimate of the p-value. I warn you though, 10^3 is already pretty slow. Hopefully some of this is useful to you. Sorry for any inconvenience this might have caused you. -Andy On 2/15/2012 1:13 PM, maria giovanna dainotti wrote: > Dear Andy Ross, > I have just got information of a new release Mathematica 8.0.4. > Do you know if the bug of Kolmogorv Smirnov in two dimension is fixed > in that version? > Unfortunately, for my purpose it is better to use a statistical tools > that make comparison directly with the data not assuming any stastistic. > > I would be very grateful if you could let me know > Best regards, > Maria > > > --- *Mer 8/2/12, maria giovanna dainotti > /<mariagiovannadainotti at yahoo.it>/* ha scritto: > > > Da: maria giovanna dainotti <mariagiovannadainotti at yahoo.it> > Oggetto: I: Re: Kolmogorov Smirnov in two or more > dimensions > A: > Data: Mercoled=EC 8 febbraio 2012, 20:58 > > > > --- *Mer 8/2/12, Andy Ross /<andyr at wolfram.com>/* ha scritto: > > > Da: Andy Ross <andyr at wolfram.com> > Oggetto: Re: Kolmogorov Smirnov in two or more > dimensions > A: "maria giovanna dainotti" <mariagiovannadainotti at yahoo.it> > Cc: mathgroup at smc.vnet.net > Data: Mercoled=EC 8 febbraio 2012, 17:09 > > This is a bug in KolmogorovSmirnovTest that will be fixed in > later versions of Mathematica. For the time being you can use > CramerVonMisesTest or DistributionFitTest. > > Note that DistributionFitTest uses a little known test > referred to as Szekely-Energy. A description can be found in... > > Szekely, G. J. and Rizzo, M. L. (2004) Testing for Equal > Distributions in High Dimension, InterStat, > November (5) > > P-values for this test are computed via Monte-Carlo simulation > at the moment so it tends to be slow. > > The good thing about this test is that it doesn't just look at > things marginally. All of the other tests currently available > perform tests on the marginal data and then aggregate the > statistics. Thus any differences in covariance structure will > be missed. > > The take home message is that if things don't fit marginally > then they don't fit jointly but if they do seem to fit well > marginally you still need to dig deeper to determine whether > the joint distributions are equivalent. > > Andy Ross > Wolfram Research > > On 2/8/2012 4:32 AM, maria giovanna dainotti wrote: > > Dear Mathematica group, > > I am doing an analysis of the Kolmogorov with two data sets > of 2 dimensions each. > > When I apply the KolmogorovSmirnovTest[Data0,Data0] I should > get 1. I did just a trial and I got 0. > > I am copying the datafile example for clarity. > > {{0.97304, 14.1829}, {0.98663, 14.1295}, {0.98284, 14.3172}, > {0.81423, > > 14.3466}, {0.97303, 14.5966}, {0.87122, 14.8435}, {0.90252, > > 14.9036}, {0.81887, 15.1177}, {1.07722, 14.6849}, {0.86684, > > 15.6456}, {0.86664, 14.7034}, {0.78728, 15.0898}, {1.10336, > > 14.4085}, {1.12014, 14.7281}, {0.95923, 14.4988}, {0.89942, > > 15.3173}, {0.83841, 14.8422}, {0.99105, 14.8813}, {1.111, > > 14.5964}, {0.93255, 15.5019}, {1.03142, 14.8009}, {1.00661, > > 15.0827}, {0.93255, 15.3064}, {1.10023, 14.7189}, {0.8797, > > 15.1038}, {1.0013, 14.6755}, {0.87673, 15.0952}, {0.84131, > > 15.7345}, {1.06392, 15.3528}, {1.00138, 14.9835}, {0.77803, > > 15.4637}, {0.76795, 15.3611}, {0.98328, 15.1047}, {0.89193, > > 14.8321}, {0.72882, 15.909}, {0.77123, 15.7902}, {0.86218, > > 15.6637}, {0.84381, 15.536}, {0.99263, 15.8903}, {1.0805, > > 15.1453}, {0.85316, 15.7793}, {0.85186, 15.9119}, {1.10898, > > 15.7583}, {1.03365, 15.7393}, {0.84783, 16.1911}, {1.10979, > > 16.0031}, {1.05238, 16.015}, {0.90259, 16.4864}, {0.84963, > > 15.9818}, {1.09221, 16.0088}, {1.0443, 15.8326}, {0.8945, > > 16.1927}, {0.83015, 16.2776}, {0.7551, 16.5538}, {1.05947, > > 15.9138}, {1.06189, 15.6061}, {1.05889, 16.0743}, {0.85216, > > 16.1568}, {0.72597, 16.7657}, {0.93638, 16.3583}, {0.81968, > > 16.2711}, {0.95022, 16.3549}, {1.04536, 16.18}, {1.0786, > > 16.0967}, {0.9385, 15.8504}, {0.95024, 15.9338}, {0.76753, > > 17.1461}, {1.18224, 16.0437}, {0.96447, 16.4908}, {0.98235, > > 16.0892}, {1.06151, 16.699}, {0.79052, 16.5207}, {1.15863, > > 16.3223}, {1.00795, 16.2444}, {1.07284, 16.6536}, {1.04796, > > 16.865}, {0.84226, 16.7247}, {1.04712, 16.2673}} > > > > Or maybe should I use some other conditions? > > Or if it doesn't work is there an already built package that > I can use? > > > > Thanks a lot for your help > > > > Best regards, > > Maria >