MathGroup Archive: July 2010 [00641]

[Date Index] [Thread Index] [Author Index]
Re: Kolmogorov-Smirnov 2-sample test
To: mathgroup at smc.vnet.net
Subject: [mg111278] Re: Kolmogorov-Smirnov 2-sample test
From: Aaron Bramson <aaronbramson at gmail.com>
Date: Mon, 26 Jul 2010 06:36:42 -0400 (EDT)
Hello everybody and thank you,

This has been very helpful, and now the two-sided K-S test for Mathematica
is online for everybody to enjoy.

I have implemented the new code from Andy and from Ray on my data set and
the code from Ray works out better for me...though I don't have the skill to
decipher what that "ugly" code is doing, I've verified several results so
I'm using those exact p-values.  I'm going to build a table of the p-values
from these tests (which is made into plot over time with the test being
performed on the individual-trial data streams of two cohorts at each time
step).

I have one last question, or maybe it's a request..   In Ray's code if I put
in two data sets wherein all the points are at the same value (e.g. all
zero) the result is not a K-stat of 0, and a p-value of 1, but rather
{-\[Infinity], 0.}.  That doesn't seem like the right answer (and in any
case not the answer that I expect or can use) so this input combination
doesn't work with how the technique calculates the stats.  So I'd like to
request a small change to the code Ray provided so that if the inputs are
all identical the output is {0,1} instead of {-\[Infinity], 0.}.  I could do
this post-facto with a replacement rule, but it would probably be better and
faster to do this in the original calculation.  But with THAT code I don't
know where to make the appropriate changes.

Again, thanks everybody for your help.

Best,
Aaron


p.s. I may end up using the Kuiper test and I might therefore have a similar
question about implementing that in Mathematica very soon.



On Fri, Jul 23, 2010 at 7:09 AM, Andy Ross <andyr at wolfram.com> wrote:

> Andy Ross wrote:
> > Bill Rowe wrote:
> >> On 7/20/10 at 3:41 AM, darreng at wolfram.com (Darren Glosemeyer) wrote:
> >>
> >>> Here is some code written by Andy Ross at Wolfram  for the two
> >>> sample Kolmogorov-Smirnov test. KolmogorovSmirnov2Sample computes
> >>> the test statistic, and KSBootstrapPValue provides a bootstrap
> >>> estimate of the p-value given the two data sets, the number of
> >>> simulations for the estimate and the test statistic.
> >>> In[1]:= empiricalCDF[data_, x_] := Length[Select[data, # <= x
> >>> &]]/Length[data]
> >>> In[2]:= KolmogorovSmirnov2Sample[data1_, data2_] :=
> >>> Block[{sd1 = Sort[data1], sd2 = Sort[data2], e1, e2,
> >>> udat = Union[Flatten[{data1, data2}]], n1 = Length[data1],
> >>> n2 = Length[data2], T},
> >>> e1 = empiricalCDF[sd1, #] & /@ udat;
> >>> e2 = empiricalCDF[sd2, #] & /@ udat;
> >>> T = Max[Abs[e1 - e2]];
> >>> (1/Sqrt[n1]) (Sqrt[(n1*n2)/(n1 + n2)]) T
> >>> ]
> >> After looking at your code above I realized I posted a very bad
> >> solution to this problem. But, it looks to me like there is a
> >> problem with this code. The returned result
> >>
> >> (1/Sqrt[n1]) (Sqrt[(n1*n2)/(n1 + n2)]) T
> >>
> >> seems to have a extra factor in it. Specifically 1/Sqrt[n1].
> >> Since n1 is the number of samples in the first data set,
> >> including this factor means you will get a different result by
> >> interchanging the order of the arguments to the function when
> >> the number of samples in each data set is different. Since the
> >> KS statistic is based on the maximum difference between the
> >> empirical CDFs, the order in which the data sets are used in the
> >> function should not matter.
> >>
> >
> > You are absolutely correct.  The factor should be removed. I believe it
> > is a remnant of an incomplete copy and paste.
> >
> > -Andy
>
> I've corrected the error in my code from before.  The p-value
> computation was giving low estimates because I was using RandomChoice
> rather than RandomSample. I believe this should do the job (though
> rather slowly).
>
> empiricalCDF[data_, x_] := Length[Select[data, # <= x &]]/Length[data]
>
> splitAtN1[udat_, n1_] := {udat[[1 ;; n1]], udat[[n1 + 1 ;; -1]]}
>
> KolmogorovSmirnov2Sample[data1_, data2_] :=
>  Block[{sd1 = Sort[data1], sd2 = Sort[data2], e1, e2,
>    udat = Union[Flatten[{data1, data2}]], n1 = Length[data1],
>    n2 = Length[data2], T},
>   e1 = empiricalCDF[sd1, #] & /@ udat;
>   e2 = empiricalCDF[sd2, #] & /@ udat;
>   T = Max[Abs[e1 - e2]];
>    (Sqrt[(n1*n2)/(n1 + n2)]) T // N
>   ]
>
> KS2BootStrapPValue[data1_, data2_, T_, MCSamp_] :=
>  Block[{n1 = Length[data1], udat = Join[data1, data2], dfts},
>   dfts = ConstantArray[0, MCSamp];
>   Do[
>    dfts[[i]] =
>     KolmogorovSmirnov2Sample @@ splitAtN1[RandomSample[udat], n1]
>    , {i, MCSamp}
>    ];
>   Length[Select[dfts, # >= T &]]/MCSamp // N
>   ]
>
> Example:
>
> data1 = {0.386653, 1.10925, 0.871822, -0.266199, 2.00516, -1.48574,
> -0.68592, -0.0461418, -0.29906, 0.209381};
>
> data2 = {-0.283594, -1.08097, 0.915052, 0.448915, -0.88062, -0.140511,
> -0.0812646, -1.1592, 0.138245, -0.314907};
>
> In[41]:= KolmogorovSmirnov2Sample[data1, data2]
>
> Out[41]= 0.67082
>
> Using 1000 bootstrap samples...
>
> In[42]:= KS2BootStrapPValue[data1, data2, .67082, 1000]
>
> Out[42]= 0.791
>
> -Andy
>
Prev by Date: Re: trading
Next by Date: Pochhammer Special Function
Previous by thread: Re: Kolmogorov-Smirnov 2-sample test
Next by thread: Problems with Workbench Debugger Breakpoints