Re: Kolmogorov-Smirnov 2-sample test
- To: mathgroup at smc.vnet.net
- Subject: [mg111180] Re: Kolmogorov-Smirnov 2-sample test
- From: Andy Ross <andyr at wolfram.com>
- Date: Fri, 23 Jul 2010 07:09:11 -0400 (EDT)
Andy Ross wrote: > Bill Rowe wrote: >> On 7/20/10 at 3:41 AM, darreng at wolfram.com (Darren Glosemeyer) wrote: >> >>> Here is some code written by Andy Ross at Wolfram for the two >>> sample Kolmogorov-Smirnov test. KolmogorovSmirnov2Sample computes >>> the test statistic, and KSBootstrapPValue provides a bootstrap >>> estimate of the p-value given the two data sets, the number of >>> simulations for the estimate and the test statistic. >>> In[1]:= empiricalCDF[data_, x_] := Length[Select[data, # <= x >>> &]]/Length[data] >>> In[2]:= KolmogorovSmirnov2Sample[data1_, data2_] := >>> Block[{sd1 = Sort[data1], sd2 = Sort[data2], e1, e2, >>> udat = Union[Flatten[{data1, data2}]], n1 = Length[data1], >>> n2 = Length[data2], T}, >>> e1 = empiricalCDF[sd1, #] & /@ udat; >>> e2 = empiricalCDF[sd2, #] & /@ udat; >>> T = Max[Abs[e1 - e2]]; >>> (1/Sqrt[n1]) (Sqrt[(n1*n2)/(n1 + n2)]) T >>> ] >> After looking at your code above I realized I posted a very bad >> solution to this problem. But, it looks to me like there is a >> problem with this code. The returned result >> >> (1/Sqrt[n1]) (Sqrt[(n1*n2)/(n1 + n2)]) T >> >> seems to have a extra factor in it. Specifically 1/Sqrt[n1]. >> Since n1 is the number of samples in the first data set, >> including this factor means you will get a different result by >> interchanging the order of the arguments to the function when >> the number of samples in each data set is different. Since the >> KS statistic is based on the maximum difference between the >> empirical CDFs, the order in which the data sets are used in the >> function should not matter. >> > > You are absolutely correct. The factor should be removed. I believe it > is a remnant of an incomplete copy and paste. > > -Andy I've corrected the error in my code from before. The p-value computation was giving low estimates because I was using RandomChoice rather than RandomSample. I believe this should do the job (though rather slowly). empiricalCDF[data_, x_] := Length[Select[data, # <= x &]]/Length[data] splitAtN1[udat_, n1_] := {udat[[1 ;; n1]], udat[[n1 + 1 ;; -1]]} KolmogorovSmirnov2Sample[data1_, data2_] := Block[{sd1 = Sort[data1], sd2 = Sort[data2], e1, e2, udat = Union[Flatten[{data1, data2}]], n1 = Length[data1], n2 = Length[data2], T}, e1 = empiricalCDF[sd1, #] & /@ udat; e2 = empiricalCDF[sd2, #] & /@ udat; T = Max[Abs[e1 - e2]]; (Sqrt[(n1*n2)/(n1 + n2)]) T // N ] KS2BootStrapPValue[data1_, data2_, T_, MCSamp_] := Block[{n1 = Length[data1], udat = Join[data1, data2], dfts}, dfts = ConstantArray[0, MCSamp]; Do[ dfts[[i]] = KolmogorovSmirnov2Sample @@ splitAtN1[RandomSample[udat], n1] , {i, MCSamp} ]; Length[Select[dfts, # >= T &]]/MCSamp // N ] Example: data1 = {0.386653, 1.10925, 0.871822, -0.266199, 2.00516, -1.48574, -0.68592, -0.0461418, -0.29906, 0.209381}; data2 = {-0.283594, -1.08097, 0.915052, 0.448915, -0.88062, -0.140511, -0.0812646, -1.1592, 0.138245, -0.314907}; In[41]:= KolmogorovSmirnov2Sample[data1, data2] Out[41]= 0.67082 Using 1000 bootstrap samples... In[42]:= KS2BootStrapPValue[data1, data2, .67082, 1000] Out[42]= 0.791 -Andy