MathGroup Archive: July 2010 [00483]

[Date Index] [Thread Index] [Author Index]

Re: Kolmogorov-Smirnov 2-sample test

To: mathgroup at smc.vnet.net
Subject: [mg111119] Re: Kolmogorov-Smirnov 2-sample test
From: Bill Rowe <readnews at sbcglobal.net>
Date: Wed, 21 Jul 2010 07:11:30 -0400 (EDT)

On 7/20/10 at 3:41 AM, darreng at wolfram.com (Darren Glosemeyer) wrote:

>Here is some code written by Andy Ross at Wolfram  for the two
>sample Kolmogorov-Smirnov test. KolmogorovSmirnov2Sample computes
>the test statistic, and KSBootstrapPValue provides a bootstrap
>estimate of the p-value given the two data sets, the number of
>simulations for the estimate and the test statistic.

>In[1]:= empiricalCDF[data_, x_] := Length[Select[data, # <= x
>&]]/Length[data]

>In[2]:= KolmogorovSmirnov2Sample[data1_, data2_] :=
>Block[{sd1 = Sort[data1], sd2 = Sort[data2], e1, e2,
>udat = Union[Flatten[{data1, data2}]], n1 = Length[data1],
>n2 = Length[data2], T},
>e1 = empiricalCDF[sd1, #] & /@ udat;
>e2 = empiricalCDF[sd2, #] & /@ udat;
>T = Max[Abs[e1 - e2]];
>(1/Sqrt[n1]) (Sqrt[(n1*n2)/(n1 + n2)]) T
>]

After looking at your code above I realized I posted a very bad
solution to this problem. But, it looks to me like there is a
problem with this code. The returned result

(1/Sqrt[n1]) (Sqrt[(n1*n2)/(n1 + n2)]) T

seems to have a extra factor in it. Specifically 1/Sqrt[n1].
Since n1 is the number of samples in the first data set,
including this factor means you will get a different result by
interchanging the order of the arguments to the function when
the number of samples in each data set is different. Since the
KS statistic is based on the maximum difference between the
empirical CDFs, the order in which the data sets are used in the
function should not matter.

Prev by Date: Re: Scoping constructs Block, Module, ModuleBlock violate

Next by Date: Re: Very very basic question about Mathematica expressions

Previous by thread: Re: Kolmogorov-Smirnov 2-sample test

Next by thread: Re: Kolmogorov-Smirnov 2-sample test