MathGroup Archive: September 2002 [00277]

[Date Index] [Thread Index] [Author Index]

RE: Empirical CDF and InterpolatingFunction

To: mathgroup at smc.vnet.net
Subject: [mg36590] RE: [mg36555] Empirical CDF and InterpolatingFunction
From: "Wolf, Hartmut" <Hartmut.Wolf at t-systems.com>
Date: Fri, 13 Sep 2002 01:14:05 -0400 (EDT)
Sender: owner-wri-mathgroup at wolfram.com

>-----Original Message-----
>From: mark at markfisher.net [mailto:mark at markfisher.net]
To: mathgroup at smc.vnet.net
>Subject: [mg36590] [mg36555] Empirical CDF and InterpolatingFunction
>
>
>I'm trying to write a fast empirical cummulative distribution function
>(CDF). Empirical CDFs are step functions that can be expressed in
>terms of a Which statement. For example, given the list of
>observations {1, 2, 3},
>
>f = Which[# < 1, 0, # < 2, 1/3, # < 3, 2/3, True, 1]&
>
>is the empirical CDF. Note that f /@ {1, 2, 3} returns {1/3, 2/3, 1}
>and f is continuous from the right.
>
>When the number of observations is large, the Which statement
>evaluates fairly slowly (even if it has been Compiled). Since
>InterpolationFunction evaluates so much faster in general, I've tried
>to use Interpolation with InterpolationOrder -> 0. The problem is that
>the resulting InterpolatingFunction doesn't behave the way (I think)
>it ought to. For example, let
>
>g = Interpolation[{{1, 1/3}, {2, 2/3}, {3, 1}}, InterpolationOrder ->
>0]
>
>Then, g /@ {1, 2, 3} returns {2/3, 2/3, 1} instead of {1/3, 2/3, 1}.
>In addition, g is continuous from the left rather than from the right.
>
>Obviously I am not aware of the considerations that went into
>determining the behavior of InterpolationFunction when
>InterpolationOrder -> 0.
>
>So I have two questions: 
>
>(1) Does anyone have any opinions about how InterpolatingFunction
>ought to behave with InterpolationOrder -> 0?
>
>(2) Does anyone have a faster way to evaluate an empirical CDF than a
>compiled Which function?
>
>By the way, here's my current version:
>
>CompileEmpiricalCDF[list_?(VectorQ[#, NumericQ] &)] :=
>  Block[{x}, Compile[{{x, _Real}}, Evaluate[
>    Which @@ Flatten[
>      Append[
>          Transpose[{
>            Thread[x < Sort[list]],
>            Range[0, 1 - 1/#, 1/#] & @ Length[list]
>              }],
>        {True, 1}]]
>  ]]]
>
>--Mark
>

Mark,

as to (1) I don't know, but following your observation, and regarding 

ff = CompileEmpiricalCDF[{1, 2, 3}]

Plot[ff[x], {x, 0, 4}, PlotRange -> {All, {0, 1}}]

as a correct CDF, we get something quite similar with

obs = {1, 2, 3}

t = With[{n = Length[obs]}, 
    Transpose[{Append[obs, Last[obs] + 1], Range[0, 1, 1/n]}]]

g = Interpolation[t, InterpolationOrder -> 0]

Off[InterpolatingFunction::"dmval"]

Plot[g[x], {x, -1, 5}, PlotRange -> All]

See that all values are shifted by one place, zero must be defined and one
point added to the right.

Now, comparing

ff /@ ({0, 1, 2, 3, 4})
{0., 0.333333, 0.666667, 1., 1.}

g /@ ({0, 1, 2, 3, 4})
{0, 1/3, 1/3, 2/3, 1}

We get something "wrong" for g, but for

g /@ ({0, 1, 2, 3, 4} + 2$MachineEpsilon)
{0, 1/3, 2/3, 1, 1}

obviously is all right. So we ran into problems of numerical precision. Now,
what is the sense of the CDF at the point of jump? None I'd say, at some
point above the jump it must count for the fraction of all events below in
respect to over all. And this is done within the limits of numerical
precision. Conceptionally the CDF is function used to integrate as a
Stieltjes Integral, so only interpretations in this contex are possible, and
you have to look at the limit from above (within the confines of numerical
precision), all ok, I think.

To dig up a bigger sample:

<< Statistics`ContinuousDistributions`

gdist = GammaDistribution[3, 1]
cdfunction = CDF[gdist, x]
pTheory = Plot[cdfunction, {x, 0, 10}, PlotStyle -> Hue[.7]]

data = RandomArray[gdist, 1000];
t = With[{n = Length[data], obs = Sort[data]}, 
      Transpose[{Append[obs, Last[obs] + 1], Range[0, 1, 1/n]}]];
g = Interpolation[t, InterpolationOrder -> 0]

pExperiment = Plot[g[x], {x, 0, 10}, PlotPoints -> 200]
Show[pExperiment, pTheory]

Nice comparison!

ff = CompileEmpiricalCDF[Sort[data]];
pExperiment2 = 
  Plot[ff[x], {x, 0, 10}, PlotStyle -> {Hue[.4, 1, .5], Thickness[.003]}, 
    PlotPoints -> 200]

Show[pTheory, pExperiment2, pExperiment]

I couldn't see any differences between pExperiment and pExperiment2
enlarged.
Timings:

rx = Table[Random[Real, {0, 10}], {10000}];

(s1 = g /@ rx); // Timing
{1.082 Second, Null}

(s2 = ff /@ rx); // Timing
{28.39 Second, Null}

s1 == s2
True 

So, I think, with deliberate use, your observation can be well exploited as
(2)!

--
Hartmut Wolf

Prev by Date: Re: creating adjacency matrices

Next by Date: Help Browser freezes

Previous by thread: RE: Empirical CDF and InterpolatingFunction

Next by thread: Re: Empirical CDF and InterpolatingFunction