MathGroup Archive: July 2010 [00505]

[Date Index] [Thread Index] [Author Index]

Re: Kolmogorov-Smirnov 2-sample test

To: mathgroup at smc.vnet.net
Subject: [mg111169] Re: Kolmogorov-Smirnov 2-sample test
From: thomas <thomas.muench at gmail.com>
Date: Thu, 22 Jul 2010 05:44:53 -0400 (EDT)
References: <i26kkn$95h$1@smc.vnet.net>

For what it's worth, here is the code I am using for the KS test. I
got the algorithm from R, and then translated it into Mathematica. I
do get different p-values than in the code by Darren, though. Maybe
the aficionados out there can figure out where the mistake might be.
Note: "data" should contain the two lists, for example

  (* create two data sets *)
data1 = RandomReal[NormalDistribution[], 15];
data2 = RandomReal[NormalDistribution[], 15];
data={data1,data2};

kolmogorovTest[data_, modus_: "two-sided"] :=
 Module[{untiedData, z, testStatistic, p},

  untiedData = data;
  With[{badguys = Select[#, #[[2]] > 1 &] & /@ Tally /@ untiedData},
   Do[(untiedData[[i,
          Flatten[Position[untiedData[[i]], #[[1]]]]]] = #[[1]] + (
          Range[0, #[[2]] - 1] - (#[[2]] - 1)/2)/(10 #[[2]])) & /@
      badguys[[i]], {i, 2}];
   ];(*replaces ties (same values, e.g. {x,x,x}) with {x-1/30,x,x+1/
  30}*)

  z = Accumulate[(1 - #)/Length@untiedData[[1]] + -#/
       Length@untiedData[[2]] &@
     UnitStep[Ordering[Join @@ untiedData] - Length@untiedData[[1]]]];

  p = Which[
    modus == "two-sided",
    N@With[{u =
        Max[Abs[z]] Sqrt[Times @@ Length /@ data/
         Plus @@ Length /@ data]},
      \[Piecewise] {
        {1, u < .2},
        {1 -
          Sqrt[2 Pi]/
           u (E^(-((25 \[Pi]^2)/(8 u^2))) +
             E^(-((9 \[Pi]^2)/(8 u^2))) + E^(-(\[Pi]^2/(8 u^2)))),
         u < .755},
        {Total[
          Take[{2 E^(-2 u^2), -2 E^(-8 u^2),
            2 E^(-18 u^2), -2 E^(-32 u^2)}, Max[1, Round[3/u]]]],
         u < 6.8116},
        {0, u >= 6.8116}
       }
      ],
    modus == "greater",
    N@Exp[-2 Max[z]^2 Times @@ Length /@ data/Plus @@ Length /@ data],
    modus == "less",
    N@Exp[-2 Min[z]^2 Times @@ Length /@ data/Plus @@ Length /@ data],
    True,
    Return["Wrong modus (has to be \"two-sided\" (default), \"greater\
\", or \"less\")."]];

  Return[p]
  ]

On Jul 21, 1:11 pm, Bill Rowe <readn... at sbcglobal.net> wrote:
> On 7/20/10 at 3:41 AM, darr... at wolfram.com (Darren Glosemeyer) wrote:
>
>
>
>
>
> >Here is some code written by Andy Ross at Wolfram  for the two
> >sample Kolmogorov-Smirnov test. KolmogorovSmirnov2Sample computes
> >the test statistic, and KSBootstrapPValue provides a bootstrap
> >estimate of the p-value given the two data sets, the number of
> >simulations for the estimate and the test statistic.
> >In[1]:= empiricalCDF[data_, x_] := Length[Select[data, # <= x
> >&]]/Length[data]
> >In[2]:= KolmogorovSmirnov2Sample[data1_, data2_] :=
> >Block[{sd1 = Sort[data1], sd2 = Sort[data2], e1, e2,
> >udat = Union[Flatten[{data1, data2}]], n1 = Length[data1],
> >n2 = Length[data2], T},
> >e1 = empiricalCDF[sd1, #] & /@ udat;
> >e2 = empiricalCDF[sd2, #] & /@ udat;
> >T = Max[Abs[e1 - e2]];
> >(1/Sqrt[n1]) (Sqrt[(n1*n2)/(n1 + n2)]) T
> >]
>
> After looking at your code above I realized I posted a very bad
> solution to this problem. But, it looks to me like there is a
> problem with this code. The returned result
>
> (1/Sqrt[n1]) (Sqrt[(n1*n2)/(n1 + n2)]) T
>
> seems to have a extra factor in it. Specifically 1/Sqrt[n1].
> Since n1 is the number of samples in the first data set,
> including this factor means you will get a different result by
> interchanging the order of the arguments to the function when
> the number of samples in each data set is different. Since the
> KS statistic is based on the maximum difference between the
> empirical CDFs, the order in which the data sets are used in the
> function should not matter.

Prev by Date: Re: Show left hand side

Next by Date: Re: Brillouin function for a Ferromagnet

Previous by thread: Re: Kolmogorov-Smirnov 2-sample test

Next by thread: Re: Kolmogorov-Smirnov 2-sample test