MathGroup Archive: May 2012 [00158]

[Date Index] [Thread Index] [Author Index]

Re: question: fitting a distribution from quantiles

To: mathgroup at smc.vnet.net
Subject: [mg126455] Re: question: fitting a distribution from quantiles
From: Darren Glosemeyer <darreng at wolfram.com>
Date: Sat, 12 May 2012 04:52:06 -0400 (EDT)
Delivered-to: l-mathgroup@mail-archive0.wolfram.com
References: <201205110414.AAA23695@smc.vnet.net> <4FAD3253.3030007@wolfram.com> <CAG-ehZ3=yMw1VvrU_1RMdD=hH_0nJ5ahP+G-Om1nJORRHD1BWg@mail.gmail.com>

There's a typo in the code using CDF directly. For the cdf, you need to 
use CDF[ParetoDistribution[k, a], x], then it will work fine.

Darren Glosemeyer
Wolfram Research

On 5/11/2012 10:55 AM, LÃ¡szlÃ³ SÃ¡ndor wrote:
> Thank you, Darren!
>
> I realized soon (much before the delay cause my the moderation of the 
> list) that I could fit a CDF. This even works with a ridiculously 
> aggregated data, e.g. only two (inverse) quantiles for a Pareto 
> distribution. However, FindFit did not work with Mathematica's 
> representation of the CDF (conditions?), only a hard-coded one.
>
> But before I paste my output (with a lengthy error message) below, let 
> me ask another question: What exactly are the benefits of keeping a 
> distribution object in the background? Am I just as well off with a 
> (smoothed) CDF and plugging it or transformations and integrals 
> everywhere?
>
> Basically, I want to use an empirical distribution in three ways:
> -- "keep it as it is" (though it must be smoothed / approximated) as I 
> do need PDFs even though, as any real data, it comes discrete
> -- fit a parametric distribution and use that everywhere where I would 
> have used the empirical
> -- fit a mixture of parametric distributions (actually, it might be a 
> special mixture: I might concatenate two different (truncated) CDF for 
> different parts -- real incomes have a Pareto right tail but an 
> obviously non-Pareto bottom.
>
> Is this  good idea to try to keep these as distributions, or as most 
> of my calculation will need to numeric anyway, I can give up early and 
> use the CDFs?
>
> Thanks!
>
> Now the output for yesterday:
>
> originalecdf = 
> {{500000,0.0182},{1000000,0.1003},{1500000,0.2487},{2000000,0.3871},{4000000,0.6802}}
> ecdf = {{2000000,0.3871},{4000000,0.6802}}
> FindFit[ecdf,CDF[ParetoDistribution[k,a]],{k,a},x]
>
> FindFit::nrlnum: The function value 
> {-0.3871+Function[\[FormalX],\[Piecewise]1. +Times[<<2>>]\[FormalX]>=k
> 0.True
>
>
>
> ,Listable],-0.6802+Function[\[FormalX],\[Piecewise]1. 
> +Times[<<2>>]\[FormalX]>=k
> 0.True
>
>
>
> ,Listable]}
>  is not a list of real numbers with dimensions {2} at {k,a} = {1.,1.}. >>
>
> FindFit[ecdf,1-(x / k)^(-a),{k,a},x]
> {k->1.18709*10^6,a->0.938482}
>
>
>
> On Fri, May 11, 2012 at 11:37 AM, Darren Glosemeyer 
> <darreng at wolfram.com <mailto:darreng at wolfram.com>> wrote:
>
>     On 5/10/2012 11:14 PM, LÃ¡szlÃ³ SÃ¡ndor wrote:
>
>         Hi all,
>
>         I have a project (with Mathematica 8) where the first step
>         would be to get the distribution describing my "data" which
>         actually only have quantiles (or worse: frequencies for
>         arbitrary bins). EstimatedDistribution[] looks promising, but
>         I don't know how to feed in this kind of data. Please let me
>         know if you know a fast way.
>
>         Thank!
>
>
>
>     There isn't enough information in your data for the types of
>     estimation done by EstimatedDistribution.
>
>     The type of information you have in your data would lend itself
>     well to a least squares fit to the cdf of the distribution. As an
>     example, let's take this data:
>
>
>     In[1]:= data = BlockRandom[SeedRandom[1234];
>               RandomVariate[GammaDistribution[5, 8], 100]];
>
>     We can use Min and Max to see the range of values and then bin
>     within that range to construct cutoff and frequency data.
>
>     In[2]:= {Min[data], Max[data]}
>
>     Out[2]= {13.7834, 112.429}
>
>
>     Here, xvals are the cutoffs and counts are the bin frequencies.
>
>     In[3]:= {xvals, counts} = HistogramList[data, {{0, 15, 20, 50,
>     100, 120}}]
>
>     Out[3]= {{0, 15, 20, 50, 100, 120}, {1, 6, 55, 37, 1}}
>
>
>     We can get the accumulated probabilities as follows.
>
>     In[4]:= probs = Accumulate[counts]/Length[data]
>
>              1    7   31  99
>     Out[4]= {---, ---, --, ---, 1}
>             100  100  50  100
>
>
>     The analogue of your quantile values would be the right endpoints,
>     Rest[xvals].
>
>     In[5]:= quantiles = Rest[xvals]
>
>     Out[5]= {15, 20, 50, 100, 120}
>
>
>     Now we can use the quantiles as the x values and the cdf values as
>     the y values for a least squares fitting to the CDF (parameters
>     may need starting values in general, but defaults worked fine in
>     this case):
>
>     In[6]:= FindFit[Transpose[{quantiles, probs}],
>     CDF[GammaDistribution[a, b], x], {a, b}, x]
>
>     Out[6]= {a -> 5.24009, b -> 8.88512}
>
>
>     Given that we know that the data don't extend to the right limit
>     of a gamma's support (gammas can be any positive values), we may
>     want to adjust the cdf values a bit. The following will shift all
>     the cdf values by 1/(2*numberOfDataPoints) in this particular case:
>
>     In[7]:= FindFit[Transpose[{quantiles, probs - 1/(2 Length[data])}],
>             CDF[GammaDistribution[a, b], x], {a, b}, x]
>
>     Out[7]= {a -> 5.3696, b -> 8.73319}
>
>
>     Darren Glosemeyer
>     Wolfram Research
>
>

References:
- question: fitting a distribution from quantiles
  - From: László Sándor <sandorl@gmail.com>

Prev by Date: Re: question: fitting a distribution from quantiles

Next by Date: Re: question: fitting a distribution from quantiles

Previous by thread: Re: question: fitting a distribution from quantiles

Next by thread: Re: question: fitting a distribution from quantiles