MathGroup Archive: May 2012 [00157]

[Date Index] [Thread Index] [Author Index]

Re: question: fitting a distribution from quantiles

To: mathgroup at smc.vnet.net
Subject: [mg126453] Re: question: fitting a distribution from quantiles
From: LÃszlÃ SÃndor <sandorl at gmail.com>
Date: Sat, 12 May 2012 04:51:24 -0400 (EDT)
Delivered-to: l-mathgroup@mail-archive0.wolfram.com
References: <201205110414.AAA23695@smc.vnet.net> <4FAD3253.3030007@wolfram.com>

Thank you, Darren!

I realized soon (much before the delay cause my the moderation of the list)
that I could fit a CDF. This even works with a ridiculously aggregated
data, e.g. only two (inverse) quantiles for a Pareto distribution. However,
FindFit did not work with Mathematica's representation of the CDF
(conditions?), only a hard-coded one.

But before I paste my output (with a lengthy error message) below, let me
ask another question: What exactly are the benefits of keeping a
distribution object in the background? Am I just as well off with a
(smoothed) CDF and plugging it or transformations and integrals everywhere?

Basically, I want to use an empirical distribution in three ways:
-- "keep it as it is" (though it must be smoothed / approximated) as I do
need PDFs even though, as any real data, it comes discrete
-- fit a parametric distribution and use that everywhere where I would have
used the empirical
-- fit a mixture of parametric distributions (actually, it might be a
special mixture: I might concatenate two different (truncated) CDF for
different parts -- real incomes have a Pareto right tail but an obviously
non-Pareto bottom.

Is this  good idea to try to keep these as distributions, or as most of my
calculation will need to numeric anyway, I can give up early and use the
CDFs?

Thanks!

Now the output for yesterday:

originalecdf =
{{500000,0.0182},{1000000,0.1003},{1500000,0.2487},{2000000,0.3871},{400000=
0,0.6802}}
ecdf = {{2000000,0.3871},{4000000,0.6802}}
FindFit[ecdf,CDF[ParetoDistribution[k,a]],{k,a},x]

FindFit::nrlnum: The function value
{-0.3871+Function[\[FormalX],\[Piecewise] 1. +Times[<<2>>] \[FormalX]>=k
0. True



,Listable],-0.6802+Function[\[FormalX],\[Piecewise] 1. +Times[<<2>>]
\[FormalX]>=k
0. True



,Listable]}
 is not a list of real numbers with dimensions {2} at {k,a} = {1.,1.}. >>

FindFit[ecdf,1-(x / k)^(-a),{k,a},x]
{k->1.18709*10^6,a->0.938482}



On Fri, May 11, 2012 at 11:37 AM, Darren Glosemeyer <darreng at wolfram.com>wr=
ote:

> On 5/10/2012 11:14 PM, L=C3=A1szl=C3=B3 S=C3=A1ndor wrote:
>
>> Hi all,
>>
>> I have a project (with Mathematica 8) where the first step would be to
>> get the distribution describing my "data" which actually only have
>> quantiles (or worse: frequencies for arbitrary bins).
>> EstimatedDistribution[] looks promising, but I don't know how to feed in
>> this kind of data. Please let me know if you know a fast way.
>>
>> Thank!
>>
>>
>>
> There isn't enough information in your data for the types of estimation
> done by EstimatedDistribution.
>
> The type of information you have in your data would lend itself well to a
> least squares fit to the cdf of the distribution. As an example, let's ta=
ke
> this data:
>
>
> In[1]:= data = BlockRandom[SeedRandom[1234];
>           RandomVariate[**GammaDistribution[5, 8], 100]];
>
> We can use Min and Max to see the range of values and then bin within tha=
t
> range to construct cutoff and frequency data.
>
> In[2]:= {Min[data], Max[data]}
>
> Out[2]= {13.7834, 112.429}
>
>
> Here, xvals are the cutoffs and counts are the bin frequencies.
>
> In[3]:= {xvals, counts} = HistogramList[data, {{0, 15, 20, 50, 100, 1=
20}}]
>
> Out[3]= {{0, 15, 20, 50, 100, 120}, {1, 6, 55, 37, 1}}
>
>
> We can get the accumulated probabilities as follows.
>
> In[4]:= probs = Accumulate[counts]/Length[**data]
>
>          1    7   31  99
> Out[4]= {---, ---, --, ---, 1}
>         100  100  50  100
>
>
> The analogue of your quantile values would be the right endpoints,
> Rest[xvals].
>
> In[5]:= quantiles = Rest[xvals]
>
> Out[5]= {15, 20, 50, 100, 120}
>
>
> Now we can use the quantiles as the x values and the cdf values as the y
> values for a least squares fitting to the CDF (parameters may need starti=
ng
> values in general, but defaults worked fine in this case):
>
> In[6]:= FindFit[Transpose[{quantiles, probs}], CDF[GammaDistribution[a,
> b], x], {a, b}, x]
>
> Out[6]= {a -> 5.24009, b -> 8.88512}
>
>
> Given that we know that the data don't extend to the right limit of a
> gamma's support (gammas can be any positive values), we may want to adjus=
t
> the cdf values a bit. The following will shift all the cdf values by
> 1/(2*numberOfDataPoints) in this particular case:
>
> In[7]:= FindFit[Transpose[{quantiles, probs - 1/(2 Length[data])}],
>         CDF[GammaDistribution[a, b], x], {a, b}, x]
>
> Out[7]= {a -> 5.3696, b -> 8.73319}
>
>
> Darren Glosemeyer
> Wolfram Research
>

References:
- question: fitting a distribution from quantiles
  - From: László Sándor <sandorl@gmail.com>

Prev by Date: Re: Fine control of evaluation

Next by Date: Re: question: fitting a distribution from quantiles

Previous by thread: Re: question: fitting a distribution from quantiles

Next by thread: Re: question: fitting a distribution from quantiles