RE: Empirical CDF and InterpolatingFunction

*To*: mathgroup at smc.vnet.net*Subject*: [mg36590] RE: [mg36555] Empirical CDF and InterpolatingFunction*From*: "Wolf, Hartmut" <Hartmut.Wolf at t-systems.com>*Date*: Fri, 13 Sep 2002 01:14:05 -0400 (EDT)*Sender*: owner-wri-mathgroup at wolfram.com

>-----Original Message----- >From: mark at markfisher.net [mailto:mark at markfisher.net] To: mathgroup at smc.vnet.net >Subject: [mg36590] [mg36555] Empirical CDF and InterpolatingFunction > > >I'm trying to write a fast empirical cummulative distribution function >(CDF). Empirical CDFs are step functions that can be expressed in >terms of a Which statement. For example, given the list of >observations {1, 2, 3}, > >f = Which[# < 1, 0, # < 2, 1/3, # < 3, 2/3, True, 1]& > >is the empirical CDF. Note that f /@ {1, 2, 3} returns {1/3, 2/3, 1} >and f is continuous from the right. > >When the number of observations is large, the Which statement >evaluates fairly slowly (even if it has been Compiled). Since >InterpolationFunction evaluates so much faster in general, I've tried >to use Interpolation with InterpolationOrder -> 0. The problem is that >the resulting InterpolatingFunction doesn't behave the way (I think) >it ought to. For example, let > >g = Interpolation[{{1, 1/3}, {2, 2/3}, {3, 1}}, InterpolationOrder -> >0] > >Then, g /@ {1, 2, 3} returns {2/3, 2/3, 1} instead of {1/3, 2/3, 1}. >In addition, g is continuous from the left rather than from the right. > >Obviously I am not aware of the considerations that went into >determining the behavior of InterpolationFunction when >InterpolationOrder -> 0. > >So I have two questions: > >(1) Does anyone have any opinions about how InterpolatingFunction >ought to behave with InterpolationOrder -> 0? > >(2) Does anyone have a faster way to evaluate an empirical CDF than a >compiled Which function? > >By the way, here's my current version: > >CompileEmpiricalCDF[list_?(VectorQ[#, NumericQ] &)] := > Block[{x}, Compile[{{x, _Real}}, Evaluate[ > Which @@ Flatten[ > Append[ > Transpose[{ > Thread[x < Sort[list]], > Range[0, 1 - 1/#, 1/#] & @ Length[list] > }], > {True, 1}]] > ]]] > >--Mark > Mark, as to (1) I don't know, but following your observation, and regarding ff = CompileEmpiricalCDF[{1, 2, 3}] Plot[ff[x], {x, 0, 4}, PlotRange -> {All, {0, 1}}] as a correct CDF, we get something quite similar with obs = {1, 2, 3} t = With[{n = Length[obs]}, Transpose[{Append[obs, Last[obs] + 1], Range[0, 1, 1/n]}]] g = Interpolation[t, InterpolationOrder -> 0] Off[InterpolatingFunction::"dmval"] Plot[g[x], {x, -1, 5}, PlotRange -> All] See that all values are shifted by one place, zero must be defined and one point added to the right. Now, comparing ff /@ ({0, 1, 2, 3, 4}) {0., 0.333333, 0.666667, 1., 1.} g /@ ({0, 1, 2, 3, 4}) {0, 1/3, 1/3, 2/3, 1} We get something "wrong" for g, but for g /@ ({0, 1, 2, 3, 4} + 2$MachineEpsilon) {0, 1/3, 2/3, 1, 1} obviously is all right. So we ran into problems of numerical precision. Now, what is the sense of the CDF at the point of jump? None I'd say, at some point above the jump it must count for the fraction of all events below in respect to over all. And this is done within the limits of numerical precision. Conceptionally the CDF is function used to integrate as a Stieltjes Integral, so only interpretations in this contex are possible, and you have to look at the limit from above (within the confines of numerical precision), all ok, I think. To dig up a bigger sample: << Statistics`ContinuousDistributions` gdist = GammaDistribution[3, 1] cdfunction = CDF[gdist, x] pTheory = Plot[cdfunction, {x, 0, 10}, PlotStyle -> Hue[.7]] data = RandomArray[gdist, 1000]; t = With[{n = Length[data], obs = Sort[data]}, Transpose[{Append[obs, Last[obs] + 1], Range[0, 1, 1/n]}]]; g = Interpolation[t, InterpolationOrder -> 0] pExperiment = Plot[g[x], {x, 0, 10}, PlotPoints -> 200] Show[pExperiment, pTheory] Nice comparison! ff = CompileEmpiricalCDF[Sort[data]]; pExperiment2 = Plot[ff[x], {x, 0, 10}, PlotStyle -> {Hue[.4, 1, .5], Thickness[.003]}, PlotPoints -> 200] Show[pTheory, pExperiment2, pExperiment] I couldn't see any differences between pExperiment and pExperiment2 enlarged. Timings: rx = Table[Random[Real, {0, 10}], {10000}]; (s1 = g /@ rx); // Timing {1.082 Second, Null} (s2 = ff /@ rx); // Timing {28.39 Second, Null} s1 == s2 True So, I think, with deliberate use, your observation can be well exploited as (2)! -- Hartmut Wolf