MathGroup Archive: July 2010 [00544]

[Date Index] [Thread Index] [Author Index]

Re: Kolmogorov-Smirnov 2-sample test

To: mathgroup at smc.vnet.net
Subject: [mg111180] Re: Kolmogorov-Smirnov 2-sample test
From: Andy Ross <andyr at wolfram.com>
Date: Fri, 23 Jul 2010 07:09:11 -0400 (EDT)

Andy Ross wrote:
> Bill Rowe wrote:
>> On 7/20/10 at 3:41 AM, darreng at wolfram.com (Darren Glosemeyer) wrote:
>>
>>> Here is some code written by Andy Ross at Wolfram  for the two
>>> sample Kolmogorov-Smirnov test. KolmogorovSmirnov2Sample computes
>>> the test statistic, and KSBootstrapPValue provides a bootstrap
>>> estimate of the p-value given the two data sets, the number of
>>> simulations for the estimate and the test statistic.
>>> In[1]:= empiricalCDF[data_, x_] := Length[Select[data, # <= x
>>> &]]/Length[data]
>>> In[2]:= KolmogorovSmirnov2Sample[data1_, data2_] :=
>>> Block[{sd1 = Sort[data1], sd2 = Sort[data2], e1, e2,
>>> udat = Union[Flatten[{data1, data2}]], n1 = Length[data1],
>>> n2 = Length[data2], T},
>>> e1 = empiricalCDF[sd1, #] & /@ udat;
>>> e2 = empiricalCDF[sd2, #] & /@ udat;
>>> T = Max[Abs[e1 - e2]];
>>> (1/Sqrt[n1]) (Sqrt[(n1*n2)/(n1 + n2)]) T
>>> ]
>> After looking at your code above I realized I posted a very bad
>> solution to this problem. But, it looks to me like there is a
>> problem with this code. The returned result
>>
>> (1/Sqrt[n1]) (Sqrt[(n1*n2)/(n1 + n2)]) T
>>
>> seems to have a extra factor in it. Specifically 1/Sqrt[n1].
>> Since n1 is the number of samples in the first data set,
>> including this factor means you will get a different result by
>> interchanging the order of the arguments to the function when
>> the number of samples in each data set is different. Since the
>> KS statistic is based on the maximum difference between the
>> empirical CDFs, the order in which the data sets are used in the
>> function should not matter.
>>
> 
> You are absolutely correct.  The factor should be removed. I believe it 
> is a remnant of an incomplete copy and paste.
> 
> -Andy

I've corrected the error in my code from before.  The p-value 
computation was giving low estimates because I was using RandomChoice 
rather than RandomSample. I believe this should do the job (though 
rather slowly).

empiricalCDF[data_, x_] := Length[Select[data, # <= x &]]/Length[data]

splitAtN1[udat_, n1_] := {udat[[1 ;; n1]], udat[[n1 + 1 ;; -1]]}

KolmogorovSmirnov2Sample[data1_, data2_] :=
  Block[{sd1 = Sort[data1], sd2 = Sort[data2], e1, e2,
    udat = Union[Flatten[{data1, data2}]], n1 = Length[data1],
    n2 = Length[data2], T},
   e1 = empiricalCDF[sd1, #] & /@ udat;
   e2 = empiricalCDF[sd2, #] & /@ udat;
   T = Max[Abs[e1 - e2]];
   (Sqrt[(n1*n2)/(n1 + n2)]) T // N
   ]

KS2BootStrapPValue[data1_, data2_, T_, MCSamp_] :=
  Block[{n1 = Length[data1], udat = Join[data1, data2], dfts},
   dfts = ConstantArray[0, MCSamp];
   Do[
    dfts[[i]] =
     KolmogorovSmirnov2Sample @@ splitAtN1[RandomSample[udat], n1]
    , {i, MCSamp}
    ];
   Length[Select[dfts, # >= T &]]/MCSamp // N
   ]

Example:

data1 = {0.386653, 1.10925, 0.871822, -0.266199, 2.00516, -1.48574, 
-0.68592, -0.0461418, -0.29906, 0.209381};

data2 = {-0.283594, -1.08097, 0.915052, 0.448915, -0.88062, -0.140511,
-0.0812646, -1.1592, 0.138245, -0.314907};

In[41]:= KolmogorovSmirnov2Sample[data1, data2]

Out[41]= 0.67082

Using 1000 bootstrap samples...

In[42]:= KS2BootStrapPValue[data1, data2, .67082, 1000]

Out[42]= 0.791

-Andy

Prev by Date: Re: Very very basic question about Mathematica expressions

Next by Date: Re: Hatched shading?

Previous by thread: Re: Kolmogorov-Smirnov 2-sample test

Next by thread: Re: Kolmogorov-Smirnov 2-sample test