Re: question: fitting a distribution from quantiles
- To: mathgroup at smc.vnet.net
- Subject: [mg126466] Re: question: fitting a distribution from quantiles
- From: LÃszlà SÃndor <sandorl at gmail.com>
- Date: Sat, 12 May 2012 04:55:54 -0400 (EDT)
- Delivered-to: l-mathgroup@mail-archive0.wolfram.com
- References: <201205110414.AAA23695@smc.vnet.net> <4FAD3253.3030007@wolfram.com>
Oh, shame, sorry about this. And thanks, of course! Laszlo On Fri, May 11, 2012 at 12:01 PM, Darren Glosemeyer <darreng at wolfram.com>wrote: > There's a typo in the code using CDF directly. For the cdf, you need to > use CDF[ParetoDistribution[k, a], x], then it will work fine. > > Darren Glosemeyer > Wolfram Research > > > On 5/11/2012 10:55 AM, L=C3=A1szl=C3=B3 S=C3=A1ndor wrote: > > Thank you, Darren! > > I realized soon (much before the delay cause my the moderation of the > list) that I could fit a CDF. This even works with a ridiculously > aggregated data, e.g. only two (inverse) quantiles for a Pareto > distribution. However, FindFit did not work with Mathematica's > representation of the CDF (conditions?), only a hard-coded one. > > But before I paste my output (with a lengthy error message) below, let > me ask another question: What exactly are the benefits of keeping a > distribution object in the background? Am I just as well off with a > (smoothed) CDF and plugging it or transformations and integrals everywhere? > > Basically, I want to use an empirical distribution in three ways: > -- "keep it as it is" (though it must be smoothed / approximated) as I do > need PDFs even though, as any real data, it comes discrete > -- fit a parametric distribution and use that everywhere where I would > have used the empirical > -- fit a mixture of parametric distributions (actually, it might be a > special mixture: I might concatenate two different (truncated) CDF for > different parts -- real incomes have a Pareto right tail but an obviously > non-Pareto bottom. > > Is this good idea to try to keep these as distributions, or as most of > my calculation will need to numeric anyway, I can give up early and use the > CDFs? > > Thanks! > > Now the output for yesterday: > > originalecdf = > {{500000,0.0182},{1000000,0.1003},{1500000,0.2487},{2000000,0.3871},{4000000,0.6802}} > ecdf = {{2000000,0.3871},{4000000,0.6802}} > FindFit[ecdf,CDF[ParetoDistribution[k,a]],{k,a},x] > > FindFit::nrlnum: The function value > {-0.3871+Function[\[FormalX],\[Piecewise] 1. +Times[<<2>>] \[FormalX]>== k > 0. True > > > > ,Listable],-0.6802+Function[\[FormalX],\[Piecewise] 1. +Times[<<2>>] > \[FormalX]>=k > 0. True > > > > ,Listable]} > is not a list of real numbers with dimensions {2} at {k,a} = {1.,1.}. >> > > FindFit[ecdf,1-(x / k)^(-a),{k,a},x] > {k->1.18709*10^6,a->0.938482} > > > > On Fri, May 11, 2012 at 11:37 AM, Darren Glosemeyer <darreng at wolfram.com> wrote: > >> On 5/10/2012 11:14 PM, L=C3=A1szl=C3=B3 S=C3=A1ndor wrote: >> >>> Hi all, >>> >>> I have a project (with Mathematica 8) where the first step would be to >>> get the distribution describing my "data" which actually only have >>> quantiles (or worse: frequencies for arbitrary bins). >>> EstimatedDistribution[] looks promising, but I don't know how to feed in >>> this kind of data. Please let me know if you know a fast way. >>> >>> Thank! >>> >>> >>> >> There isn't enough information in your data for the types of estimation >> done by EstimatedDistribution. >> >> The type of information you have in your data would lend itself well to a >> least squares fit to the cdf of the distribution. As an example, let's take >> this data: >> >> >> In[1]:= data = BlockRandom[SeedRandom[1234]; >> RandomVariate[GammaDistribution[5, 8], 100]]; >> >> We can use Min and Max to see the range of values and then bin within >> that range to construct cutoff and frequency data. >> >> In[2]:= {Min[data], Max[data]} >> >> Out[2]= {13.7834, 112.429} >> >> >> Here, xvals are the cutoffs and counts are the bin frequencies. >> >> In[3]:= {xvals, counts} = HistogramList[data, {{0, 15, 20, 50, 100, 120}}] >> >> Out[3]= {{0, 15, 20, 50, 100, 120}, {1, 6, 55, 37, 1}} >> >> >> We can get the accumulated probabilities as follows. >> >> In[4]:= probs = Accumulate[counts]/Length[data] >> >> 1 7 31 99 >> Out[4]= {---, ---, --, ---, 1} >> 100 100 50 100 >> >> >> The analogue of your quantile values would be the right endpoints, >> Rest[xvals]. >> >> In[5]:= quantiles = Rest[xvals] >> >> Out[5]= {15, 20, 50, 100, 120} >> >> >> Now we can use the quantiles as the x values and the cdf values as the y >> values for a least squares fitting to the CDF (parameters may need starting >> values in general, but defaults worked fine in this case): >> >> In[6]:= FindFit[Transpose[{quantiles, probs}], CDF[GammaDistribution[a, >> b], x], {a, b}, x] >> >> Out[6]= {a -> 5.24009, b -> 8.88512} >> >> >> Given that we know that the data don't extend to the right limit of a >> gamma's support (gammas can be any positive values), we may want to adjust >> the cdf values a bit. The following will shift all the cdf values by >> 1/(2*numberOfDataPoints) in this particular case: >> >> In[7]:= FindFit[Transpose[{quantiles, probs - 1/(2 Length[data])}], >> CDF[GammaDistribution[a, b], x], {a, b}, x] >> >> Out[7]= {a -> 5.3696, b -> 8.73319} >> >> >> Darren Glosemeyer >> Wolfram Research >> > > >
- References:
- question: fitting a distribution from quantiles
- From: László Sándor <sandorl@gmail.com>
- question: fitting a distribution from quantiles