MathGroup Archive: July 2010 [00453]

[Date Index] [Thread Index] [Author Index]

Re: Kolmogorov-Smirnov 2-sample test

To: mathgroup at smc.vnet.net
Subject: [mg111085] Re: Kolmogorov-Smirnov 2-sample test
From: Darren Glosemeyer <darreng at wolfram.com>
Date: Tue, 20 Jul 2010 03:41:19 -0400 (EDT)

Aaron Bramson wrote:
> Hello Everybody,
>
> I would like to perform a 2-sample k-s test.  I've seen some posts on the
> archive about the one-sample goodness-of-fit version of the
> Kolgomorov-Smirnov test, but I'm interested in the 2-sample version.
>
> Here's a description of the method:
> http://en.wikipedia.org/wiki/Kolmogorov%E2%80%93Smirnov_test#Two-sample_Kolmogorov.E2.80.93Smirnov_test
>
> I think all the necessary components are available; e.g.
> Accumulate[BinCounts[_list_]] to get the ecdf of both datasets, abs, max of
> a list, etc.  But the data management is a bit above my current skill
> level.  Also, since all other software packages seem to include this test
> capability, I would be really surprised if there
> wasn't a package somewhere that included it by now, but I've searched a lot
> and can't find it.  Can anybody help me locate this this?
>
> Alternatively, would anybody like to work with me to build this in case it
> can't be found?
>
> Thanks in Advance,
> Aaron
>
>
>   

Hi Aaron,

Here is some code written by Andy Ross at Wolfram  for the two sample 
Kolmogorov-Smirnov test. KolmogorovSmirnov2Sample computes the test 
statistic, and KSBootstrapPValue provides a bootstrap estimate of the 
p-value given the two data sets, the number of simulations for the 
estimate and the test statistic.

In[1]:= empiricalCDF[data_, x_] := Length[Select[data, # <= x 
&]]/Length[data]

In[2]:= KolmogorovSmirnov2Sample[data1_, data2_] :=
         Block[{sd1 = Sort[data1], sd2 = Sort[data2], e1, e2,
           udat = Union[Flatten[{data1, data2}]], n1 = Length[data1],
           n2 = Length[data2], T},
          e1 = empiricalCDF[sd1, #] & /@ udat;
          e2 = empiricalCDF[sd2, #] & /@ udat;
          T = Max[Abs[e1 - e2]];
          (1/Sqrt[n1]) (Sqrt[(n1*n2)/(n1 + n2)]) T
          ]

In[3]:= KSBootstrapPValue[data1_, data2_, nSamp_, T_] :=
         Block[{udat = Union[Flatten[{data1, data2}]], n1 = Length[data1],
           n2 = Length[data2], d1Samples, d2Samples, Tdist},
          d1Samples = RandomChoice[udat, {nSamp, n1}];
          d2Samples = RandomChoice[udat, {nSamp, n2}];
          Tdist =
           MapThread[
            KolmogorovSmirnov2Sample[#1, #2] &, {d1Samples, d2Samples}];
          1 - empiricalCDF[Tdist, T] // N
          ]

(* create two data sets *)
In[4]:= data1 = RandomReal[NormalDistribution[], 15];

In[5]:= data2 = RandomReal[NormalDistribution[], 15];

(* compute the test statistic *)
In[6]:= stat = KolmogorovSmirnov2Sample[data1, data2] // N

Out[6]= 0.329983

(* estimate the p-value from 10000 bootstrap samples *)
In[7]:= KSBootstrapPValue[data1, data2, 10000, stat]

Out[7]= 0.0213


Darren Glosemeyer
Wolfram Research

Prev by Date: Re: Transform differential equation by tranformation rule

Next by Date: Re: Avoid the use of certain functions

Previous by thread: Re: Kolmogorov-Smirnov 2-sample test

Next by thread: Re: Kolmogorov-Smirnov 2-sample test