Re: Kolmogorov-Smirnov 2-sample test
- To: mathgroup at smc.vnet.net
- Subject: [mg111085] Re: Kolmogorov-Smirnov 2-sample test
- From: Darren Glosemeyer <darreng at wolfram.com>
- Date: Tue, 20 Jul 2010 03:41:19 -0400 (EDT)
Aaron Bramson wrote: > Hello Everybody, > > I would like to perform a 2-sample k-s test. I've seen some posts on the > archive about the one-sample goodness-of-fit version of the > Kolgomorov-Smirnov test, but I'm interested in the 2-sample version. > > Here's a description of the method: > http://en.wikipedia.org/wiki/Kolmogorov%E2%80%93Smirnov_test#Two-sample_Kolmogorov.E2.80.93Smirnov_test > > I think all the necessary components are available; e.g. > Accumulate[BinCounts[_list_]] to get the ecdf of both datasets, abs, max of > a list, etc. But the data management is a bit above my current skill > level. Also, since all other software packages seem to include this test > capability, I would be really surprised if there > wasn't a package somewhere that included it by now, but I've searched a lot > and can't find it. Can anybody help me locate this this? > > Alternatively, would anybody like to work with me to build this in case it > can't be found? > > Thanks in Advance, > Aaron > > > Hi Aaron, Here is some code written by Andy Ross at Wolfram for the two sample Kolmogorov-Smirnov test. KolmogorovSmirnov2Sample computes the test statistic, and KSBootstrapPValue provides a bootstrap estimate of the p-value given the two data sets, the number of simulations for the estimate and the test statistic. In[1]:= empiricalCDF[data_, x_] := Length[Select[data, # <= x &]]/Length[data] In[2]:= KolmogorovSmirnov2Sample[data1_, data2_] := Block[{sd1 = Sort[data1], sd2 = Sort[data2], e1, e2, udat = Union[Flatten[{data1, data2}]], n1 = Length[data1], n2 = Length[data2], T}, e1 = empiricalCDF[sd1, #] & /@ udat; e2 = empiricalCDF[sd2, #] & /@ udat; T = Max[Abs[e1 - e2]]; (1/Sqrt[n1]) (Sqrt[(n1*n2)/(n1 + n2)]) T ] In[3]:= KSBootstrapPValue[data1_, data2_, nSamp_, T_] := Block[{udat = Union[Flatten[{data1, data2}]], n1 = Length[data1], n2 = Length[data2], d1Samples, d2Samples, Tdist}, d1Samples = RandomChoice[udat, {nSamp, n1}]; d2Samples = RandomChoice[udat, {nSamp, n2}]; Tdist = MapThread[ KolmogorovSmirnov2Sample[#1, #2] &, {d1Samples, d2Samples}]; 1 - empiricalCDF[Tdist, T] // N ] (* create two data sets *) In[4]:= data1 = RandomReal[NormalDistribution[], 15]; In[5]:= data2 = RandomReal[NormalDistribution[], 15]; (* compute the test statistic *) In[6]:= stat = KolmogorovSmirnov2Sample[data1, data2] // N Out[6]= 0.329983 (* estimate the p-value from 10000 bootstrap samples *) In[7]:= KSBootstrapPValue[data1, data2, 10000, stat] Out[7]= 0.0213 Darren Glosemeyer Wolfram Research