Re: Kolmogorov-Smirnov 2-sample test
- To: mathgroup at smc.vnet.net
- Subject: [mg111169] Re: Kolmogorov-Smirnov 2-sample test
- From: thomas <thomas.muench at gmail.com>
- Date: Thu, 22 Jul 2010 05:44:53 -0400 (EDT)
- References: <i26kkn$95h$1@smc.vnet.net>
For what it's worth, here is the code I am using for the KS test. I
got the algorithm from R, and then translated it into Mathematica. I
do get different p-values than in the code by Darren, though. Maybe
the aficionados out there can figure out where the mistake might be.
Note: "data" should contain the two lists, for example
(* create two data sets *)
data1 = RandomReal[NormalDistribution[], 15];
data2 = RandomReal[NormalDistribution[], 15];
data={data1,data2};
kolmogorovTest[data_, modus_: "two-sided"] :=
Module[{untiedData, z, testStatistic, p},
untiedData = data;
With[{badguys = Select[#, #[[2]] > 1 &] & /@ Tally /@ untiedData},
Do[(untiedData[[i,
Flatten[Position[untiedData[[i]], #[[1]]]]]] = #[[1]] + (
Range[0, #[[2]] - 1] - (#[[2]] - 1)/2)/(10 #[[2]])) & /@
badguys[[i]], {i, 2}];
];(*replaces ties (same values, e.g. {x,x,x}) with {x-1/30,x,x+1/
30}*)
z = Accumulate[(1 - #)/Length@untiedData[[1]] + -#/
Length@untiedData[[2]] &@
UnitStep[Ordering[Join @@ untiedData] - Length@untiedData[[1]]]];
p = Which[
modus == "two-sided",
N@With[{u =
Max[Abs[z]] Sqrt[Times @@ Length /@ data/
Plus @@ Length /@ data]},
\[Piecewise] {
{1, u < .2},
{1 -
Sqrt[2 Pi]/
u (E^(-((25 \[Pi]^2)/(8 u^2))) +
E^(-((9 \[Pi]^2)/(8 u^2))) + E^(-(\[Pi]^2/(8 u^2)))),
u < .755},
{Total[
Take[{2 E^(-2 u^2), -2 E^(-8 u^2),
2 E^(-18 u^2), -2 E^(-32 u^2)}, Max[1, Round[3/u]]]],
u < 6.8116},
{0, u >= 6.8116}
}
],
modus == "greater",
N@Exp[-2 Max[z]^2 Times @@ Length /@ data/Plus @@ Length /@ data],
modus == "less",
N@Exp[-2 Min[z]^2 Times @@ Length /@ data/Plus @@ Length /@ data],
True,
Return["Wrong modus (has to be \"two-sided\" (default), \"greater\
\", or \"less\")."]];
Return[p]
]
On Jul 21, 1:11 pm, Bill Rowe <readn... at sbcglobal.net> wrote:
> On 7/20/10 at 3:41 AM, darr... at wolfram.com (Darren Glosemeyer) wrote:
>
>
>
>
>
> >Here is some code written by Andy Ross at Wolfram for the two
> >sample Kolmogorov-Smirnov test. KolmogorovSmirnov2Sample computes
> >the test statistic, and KSBootstrapPValue provides a bootstrap
> >estimate of the p-value given the two data sets, the number of
> >simulations for the estimate and the test statistic.
> >In[1]:= empiricalCDF[data_, x_] := Length[Select[data, # <= x
> >&]]/Length[data]
> >In[2]:= KolmogorovSmirnov2Sample[data1_, data2_] :=
> >Block[{sd1 = Sort[data1], sd2 = Sort[data2], e1, e2,
> >udat = Union[Flatten[{data1, data2}]], n1 = Length[data1],
> >n2 = Length[data2], T},
> >e1 = empiricalCDF[sd1, #] & /@ udat;
> >e2 = empiricalCDF[sd2, #] & /@ udat;
> >T = Max[Abs[e1 - e2]];
> >(1/Sqrt[n1]) (Sqrt[(n1*n2)/(n1 + n2)]) T
> >]
>
> After looking at your code above I realized I posted a very bad
> solution to this problem. But, it looks to me like there is a
> problem with this code. The returned result
>
> (1/Sqrt[n1]) (Sqrt[(n1*n2)/(n1 + n2)]) T
>
> seems to have a extra factor in it. Specifically 1/Sqrt[n1].
> Since n1 is the number of samples in the first data set,
> including this factor means you will get a different result by
> interchanging the order of the arguments to the function when
> the number of samples in each data set is different. Since the
> KS statistic is based on the maximum difference between the
> empirical CDFs, the order in which the data sets are used in the
> function should not matter.