Re: Kolmogorov-Smirnov 2-sample test
- To: mathgroup at smc.vnet.net
- Subject: [mg111119] Re: Kolmogorov-Smirnov 2-sample test
- From: Bill Rowe <readnews at sbcglobal.net>
- Date: Wed, 21 Jul 2010 07:11:30 -0400 (EDT)
On 7/20/10 at 3:41 AM, darreng at wolfram.com (Darren Glosemeyer) wrote: >Here is some code written by Andy Ross at Wolfram for the two >sample Kolmogorov-Smirnov test. KolmogorovSmirnov2Sample computes >the test statistic, and KSBootstrapPValue provides a bootstrap >estimate of the p-value given the two data sets, the number of >simulations for the estimate and the test statistic. >In[1]:= empiricalCDF[data_, x_] := Length[Select[data, # <= x >&]]/Length[data] >In[2]:= KolmogorovSmirnov2Sample[data1_, data2_] := >Block[{sd1 = Sort[data1], sd2 = Sort[data2], e1, e2, >udat = Union[Flatten[{data1, data2}]], n1 = Length[data1], >n2 = Length[data2], T}, >e1 = empiricalCDF[sd1, #] & /@ udat; >e2 = empiricalCDF[sd2, #] & /@ udat; >T = Max[Abs[e1 - e2]]; >(1/Sqrt[n1]) (Sqrt[(n1*n2)/(n1 + n2)]) T >] After looking at your code above I realized I posted a very bad solution to this problem. But, it looks to me like there is a problem with this code. The returned result (1/Sqrt[n1]) (Sqrt[(n1*n2)/(n1 + n2)]) T seems to have a extra factor in it. Specifically 1/Sqrt[n1]. Since n1 is the number of samples in the first data set, including this factor means you will get a different result by interchanging the order of the arguments to the function when the number of samples in each data set is different. Since the KS statistic is based on the maximum difference between the empirical CDFs, the order in which the data sets are used in the function should not matter.