Re: Kolmogorov-Smirnov 2-sample test

*To*: mathgroup at smc.vnet.net*Subject*: [mg111278] Re: Kolmogorov-Smirnov 2-sample test*From*: Aaron Bramson <aaronbramson at gmail.com>*Date*: Mon, 26 Jul 2010 06:36:42 -0400 (EDT)

Hello everybody and thank you, This has been very helpful, and now the two-sided K-S test for Mathematica is online for everybody to enjoy. I have implemented the new code from Andy and from Ray on my data set and the code from Ray works out better for me...though I don't have the skill to decipher what that "ugly" code is doing, I've verified several results so I'm using those exact p-values. I'm going to build a table of the p-values from these tests (which is made into plot over time with the test being performed on the individual-trial data streams of two cohorts at each time step). I have one last question, or maybe it's a request.. In Ray's code if I put in two data sets wherein all the points are at the same value (e.g. all zero) the result is not a K-stat of 0, and a p-value of 1, but rather {-\[Infinity], 0.}. That doesn't seem like the right answer (and in any case not the answer that I expect or can use) so this input combination doesn't work with how the technique calculates the stats. So I'd like to request a small change to the code Ray provided so that if the inputs are all identical the output is {0,1} instead of {-\[Infinity], 0.}. I could do this post-facto with a replacement rule, but it would probably be better and faster to do this in the original calculation. But with THAT code I don't know where to make the appropriate changes. Again, thanks everybody for your help. Best, Aaron p.s. I may end up using the Kuiper test and I might therefore have a similar question about implementing that in Mathematica very soon. On Fri, Jul 23, 2010 at 7:09 AM, Andy Ross <andyr at wolfram.com> wrote: > Andy Ross wrote: > > Bill Rowe wrote: > >> On 7/20/10 at 3:41 AM, darreng at wolfram.com (Darren Glosemeyer) wrote: > >> > >>> Here is some code written by Andy Ross at Wolfram for the two > >>> sample Kolmogorov-Smirnov test. KolmogorovSmirnov2Sample computes > >>> the test statistic, and KSBootstrapPValue provides a bootstrap > >>> estimate of the p-value given the two data sets, the number of > >>> simulations for the estimate and the test statistic. > >>> In[1]:= empiricalCDF[data_, x_] := Length[Select[data, # <= x > >>> &]]/Length[data] > >>> In[2]:= KolmogorovSmirnov2Sample[data1_, data2_] := > >>> Block[{sd1 = Sort[data1], sd2 = Sort[data2], e1, e2, > >>> udat = Union[Flatten[{data1, data2}]], n1 = Length[data1], > >>> n2 = Length[data2], T}, > >>> e1 = empiricalCDF[sd1, #] & /@ udat; > >>> e2 = empiricalCDF[sd2, #] & /@ udat; > >>> T = Max[Abs[e1 - e2]]; > >>> (1/Sqrt[n1]) (Sqrt[(n1*n2)/(n1 + n2)]) T > >>> ] > >> After looking at your code above I realized I posted a very bad > >> solution to this problem. But, it looks to me like there is a > >> problem with this code. The returned result > >> > >> (1/Sqrt[n1]) (Sqrt[(n1*n2)/(n1 + n2)]) T > >> > >> seems to have a extra factor in it. Specifically 1/Sqrt[n1]. > >> Since n1 is the number of samples in the first data set, > >> including this factor means you will get a different result by > >> interchanging the order of the arguments to the function when > >> the number of samples in each data set is different. Since the > >> KS statistic is based on the maximum difference between the > >> empirical CDFs, the order in which the data sets are used in the > >> function should not matter. > >> > > > > You are absolutely correct. The factor should be removed. I believe it > > is a remnant of an incomplete copy and paste. > > > > -Andy > > I've corrected the error in my code from before. The p-value > computation was giving low estimates because I was using RandomChoice > rather than RandomSample. I believe this should do the job (though > rather slowly). > > empiricalCDF[data_, x_] := Length[Select[data, # <= x &]]/Length[data] > > splitAtN1[udat_, n1_] := {udat[[1 ;; n1]], udat[[n1 + 1 ;; -1]]} > > KolmogorovSmirnov2Sample[data1_, data2_] := > Block[{sd1 = Sort[data1], sd2 = Sort[data2], e1, e2, > udat = Union[Flatten[{data1, data2}]], n1 = Length[data1], > n2 = Length[data2], T}, > e1 = empiricalCDF[sd1, #] & /@ udat; > e2 = empiricalCDF[sd2, #] & /@ udat; > T = Max[Abs[e1 - e2]]; > (Sqrt[(n1*n2)/(n1 + n2)]) T // N > ] > > KS2BootStrapPValue[data1_, data2_, T_, MCSamp_] := > Block[{n1 = Length[data1], udat = Join[data1, data2], dfts}, > dfts = ConstantArray[0, MCSamp]; > Do[ > dfts[[i]] = > KolmogorovSmirnov2Sample @@ splitAtN1[RandomSample[udat], n1] > , {i, MCSamp} > ]; > Length[Select[dfts, # >= T &]]/MCSamp // N > ] > > Example: > > data1 = {0.386653, 1.10925, 0.871822, -0.266199, 2.00516, -1.48574, > -0.68592, -0.0461418, -0.29906, 0.209381}; > > data2 = {-0.283594, -1.08097, 0.915052, 0.448915, -0.88062, -0.140511, > -0.0812646, -1.1592, 0.138245, -0.314907}; > > In[41]:= KolmogorovSmirnov2Sample[data1, data2] > > Out[41]= 0.67082 > > Using 1000 bootstrap samples... > > In[42]:= KS2BootStrapPValue[data1, data2, .67082, 1000] > > Out[42]= 0.791 > > -Andy >