Re: Kolmogorov-Smirnov 2-sample test

*To*: mathgroup at smc.vnet.net*Subject*: [mg111157] Re: Kolmogorov-Smirnov 2-sample test*From*: Andy Ross <andyr at wolfram.com>*Date*: Thu, 22 Jul 2010 05:42:40 -0400 (EDT)

Bill Rowe wrote: > On 7/20/10 at 3:41 AM, darreng at wolfram.com (Darren Glosemeyer) wrote: > >> Here is some code written by Andy Ross at Wolfram for the two >> sample Kolmogorov-Smirnov test. KolmogorovSmirnov2Sample computes >> the test statistic, and KSBootstrapPValue provides a bootstrap >> estimate of the p-value given the two data sets, the number of >> simulations for the estimate and the test statistic. > >> In[1]:= empiricalCDF[data_, x_] := Length[Select[data, # <= x >> &]]/Length[data] > >> In[2]:= KolmogorovSmirnov2Sample[data1_, data2_] := >> Block[{sd1 = Sort[data1], sd2 = Sort[data2], e1, e2, >> udat = Union[Flatten[{data1, data2}]], n1 = Length[data1], >> n2 = Length[data2], T}, >> e1 = empiricalCDF[sd1, #] & /@ udat; >> e2 = empiricalCDF[sd2, #] & /@ udat; >> T = Max[Abs[e1 - e2]]; >> (1/Sqrt[n1]) (Sqrt[(n1*n2)/(n1 + n2)]) T >> ] > > After looking at your code above I realized I posted a very bad > solution to this problem. But, it looks to me like there is a > problem with this code. The returned result > > (1/Sqrt[n1]) (Sqrt[(n1*n2)/(n1 + n2)]) T > > seems to have a extra factor in it. Specifically 1/Sqrt[n1]. > Since n1 is the number of samples in the first data set, > including this factor means you will get a different result by > interchanging the order of the arguments to the function when > the number of samples in each data set is different. Since the > KS statistic is based on the maximum difference between the > empirical CDFs, the order in which the data sets are used in the > function should not matter. > You are absolutely correct. The factor should be removed. I believe it is a remnant of an incomplete copy and paste. -Andy