Re: Kolmogorov-Smirnov 2-sample test
- To: mathgroup at smc.vnet.net
- Subject: [mg111169] Re: Kolmogorov-Smirnov 2-sample test
- From: thomas <thomas.muench at gmail.com>
- Date: Thu, 22 Jul 2010 05:44:53 -0400 (EDT)
- References: <i26kkn$95h$1@smc.vnet.net>
For what it's worth, here is the code I am using for the KS test. I got the algorithm from R, and then translated it into Mathematica. I do get different p-values than in the code by Darren, though. Maybe the aficionados out there can figure out where the mistake might be. Note: "data" should contain the two lists, for example (* create two data sets *) data1 = RandomReal[NormalDistribution[], 15]; data2 = RandomReal[NormalDistribution[], 15]; data={data1,data2}; kolmogorovTest[data_, modus_: "two-sided"] := Module[{untiedData, z, testStatistic, p}, untiedData = data; With[{badguys = Select[#, #[[2]] > 1 &] & /@ Tally /@ untiedData}, Do[(untiedData[[i, Flatten[Position[untiedData[[i]], #[[1]]]]]] = #[[1]] + ( Range[0, #[[2]] - 1] - (#[[2]] - 1)/2)/(10 #[[2]])) & /@ badguys[[i]], {i, 2}]; ];(*replaces ties (same values, e.g. {x,x,x}) with {x-1/30,x,x+1/ 30}*) z = Accumulate[(1 - #)/Length@untiedData[[1]] + -#/ Length@untiedData[[2]] &@ UnitStep[Ordering[Join @@ untiedData] - Length@untiedData[[1]]]]; p = Which[ modus == "two-sided", N@With[{u = Max[Abs[z]] Sqrt[Times @@ Length /@ data/ Plus @@ Length /@ data]}, \[Piecewise] { {1, u < .2}, {1 - Sqrt[2 Pi]/ u (E^(-((25 \[Pi]^2)/(8 u^2))) + E^(-((9 \[Pi]^2)/(8 u^2))) + E^(-(\[Pi]^2/(8 u^2)))), u < .755}, {Total[ Take[{2 E^(-2 u^2), -2 E^(-8 u^2), 2 E^(-18 u^2), -2 E^(-32 u^2)}, Max[1, Round[3/u]]]], u < 6.8116}, {0, u >= 6.8116} } ], modus == "greater", N@Exp[-2 Max[z]^2 Times @@ Length /@ data/Plus @@ Length /@ data], modus == "less", N@Exp[-2 Min[z]^2 Times @@ Length /@ data/Plus @@ Length /@ data], True, Return["Wrong modus (has to be \"two-sided\" (default), \"greater\ \", or \"less\")."]]; Return[p] ] On Jul 21, 1:11 pm, Bill Rowe <readn... at sbcglobal.net> wrote: > On 7/20/10 at 3:41 AM, darr... at wolfram.com (Darren Glosemeyer) wrote: > > > > > > >Here is some code written by Andy Ross at Wolfram for the two > >sample Kolmogorov-Smirnov test. KolmogorovSmirnov2Sample computes > >the test statistic, and KSBootstrapPValue provides a bootstrap > >estimate of the p-value given the two data sets, the number of > >simulations for the estimate and the test statistic. > >In[1]:= empiricalCDF[data_, x_] := Length[Select[data, # <= x > >&]]/Length[data] > >In[2]:= KolmogorovSmirnov2Sample[data1_, data2_] := > >Block[{sd1 = Sort[data1], sd2 = Sort[data2], e1, e2, > >udat = Union[Flatten[{data1, data2}]], n1 = Length[data1], > >n2 = Length[data2], T}, > >e1 = empiricalCDF[sd1, #] & /@ udat; > >e2 = empiricalCDF[sd2, #] & /@ udat; > >T = Max[Abs[e1 - e2]]; > >(1/Sqrt[n1]) (Sqrt[(n1*n2)/(n1 + n2)]) T > >] > > After looking at your code above I realized I posted a very bad > solution to this problem. But, it looks to me like there is a > problem with this code. The returned result > > (1/Sqrt[n1]) (Sqrt[(n1*n2)/(n1 + n2)]) T > > seems to have a extra factor in it. Specifically 1/Sqrt[n1]. > Since n1 is the number of samples in the first data set, > including this factor means you will get a different result by > interchanging the order of the arguments to the function when > the number of samples in each data set is different. Since the > KS statistic is based on the maximum difference between the > empirical CDFs, the order in which the data sets are used in the > function should not matter.