Re: fit a BinomialDistribution to exptl data?

*To*: mathgroup at smc.vnet.net*Subject*: [mg80915] Re: fit a BinomialDistribution to exptl data?*From*: Mark Fisher <particlefilter at gmail.com>*Date*: Wed, 5 Sep 2007 02:58:09 -0400 (EDT)*References*: <11420600.1188458042685.JavaMail.root@m35>

On Sep 1, 12:43 am, Ray Koopman <koop... at sfu.ca> wrote: > Also, it is possible that > the ML estimate of n may not exist because the likelihood, maximized > w.r.t. p given n, may be monotone increasing in n, as is the case > with the following 200 samples from a Binomial[20,.2] distribution: > > {x = {0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11}, > f = {3, 21, 28, 37, 43, 28, 22, 8, 6, 2, 1, 1}}; > > which has {mean, variance} = {3.875, 4.02952}. It is not obvious > (except perhaps to a Bayesian) what should be done in cases such as > this. It seems this thread, which began as a discussion of fitting a functional forms, has evolved into a discussion of statistical inference. (The two subjects are related, but distinct.) My post is intended to contribute to the statistical inference discussion by providing a Bayesian perspective. (The code is written using Version 6.) The Bayesian approach produces probability statements about the "unknowns" given the the data. In this context, the data are those given by Ray Koopman and the unknowns are n and p, the parameters of the binomial distribution. In order to make probability statements about about the joint distribution for (n,p) given the data, the rules of probability require us to specify a prior distribution that characterizes what we knew before we saw the data. Some notation will be helpful. Let f(n,p) denote the joint prior distribution, let f(n,p|D) denote the joint posterior distribution given the data D, and let f(D|n,p) denote the sampling distribution of the data given the parameters. (In a Bayesian analysis, the sampling distribution becomes the "likelihood" of the parameters given the data.) Bayes' rule says f(n,p|D) is proportional to g(n,p) := f(n,p) * f(D|n,p). The typical procedure is to generate draws from f(n,p|D) using MCMC techniques (Markov Chain Monte Carlo). In this example, I take the following route instead. First, factor the prior into marginal and conditional: f(n,p) = f(n) * f(p|n). Second, integrate p out of both sides of Bayes' rule, so that f(n|D) is proportional to g(n) := f(D|n) * f(n). Finally, generate draws from the joint posterior by drawing successively from f(n|D) and then from f(p|n,D), the latter of which is proportional to g(p) := f(D|n,p) * f(p|n). For this example, I characterize the prior knowledge about n and p as follows: (1) n and p are independent, so that f(n,p) = f(n) * f(p), (2) all values of p are equally likely, so that f(p) = 1 for 0 <= p <= 1, and (3) the probability of n is given by f(n) = .03*(1-.03)^(n-1). (I'll comment on these choices at the end.) pprior[p_] = 1 nprior[n_] = .03*(1-.03)^(n-1) The data are those given by Ray Koopman: data0 = Transpose[{ {0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11}, {3, 21, 28, 37, 43, 28, 22, 8, 6, 2, 1, 1}}]; data = Flatten[ConstantArray @@@ data0]; The likelihood: jointlikelihood[n_, p_] = Times @@ (PDF[BinomialDistribution[n, p], #] & /@ data); An aside: Ray Koopman's comment about the absence of a maximum refers to h(n) := max_p f(D|n,p), which is increasing in n. This is illustrated by ListLinePlot[Table[ FindMaximum[Log[jointlikelihood[n, p]], {p, .05, .06}][[1]], {n, 11, 100}], DataRange -> {11, 100}] Now compute f(D|n) by *integrating* p out (rather than *maximizing* p out). (Although I specify the assumption n >= 11, the resulting expression is valid for n < 11 as well: the likelihood of n for n < 11 is zero.) Even ignoring the prior for n (which downweights large values of n), the marginal likelihood for n has a maximum (at n=53): nlikelihood[n_] = Integrate[jointlikelihood[n, p] * pprior[p], {p, 0, 1}, Assumptions -> n >= 11]; Compute f(n|D) from f(D|n) * f(n). (I have chosen the upper bound of 300 for convenience; the prior probability of n > 300 is about 10^(-4).) From f(n|D), i.e., nposteriorList, the most likely value for n is 36 (nmax) and, given n=nmax, the maximum value for p is about 0.11 (pmax): nposteriorList = #/Total[#]& @ Table[nlikelihood[n] nprior[n], {n, 1, 300}]; ListLinePlot[nposteriorList, PlotRange -> All] nmax = SortBy[Transpose[{Range[300], nposteriorList}], Last][[-1, 1]] pmax = p /. Last@FindMaximum[Log[jointlikelihood[nmax, p]], {p, .1, . 15}] Now compute f(n,p|D) by drawing from f(n|D) and f(p|n,D). By examining the f(D|n,p), one can determine that f(p|n,D) = PDF[BetaDistribution[776, 200 n - 774], p]. We see that n and p are highly negatively correlated and the the most likely value for p is noticably less than pmax: draws = Table[With[{ n = RandomChoice[nposteriorList[[11 ;;]] -> Range[11,300]]}, {n, RandomReal[BetaDistribution[776, 200 n - 774]]}], {10^5}]; bc = BinCounts[draws, {.5, 300.5, 1}, {0, .35, .005}]; ListDensityPlot[Transpose[bc], ColorFunction -> (GrayLevel[1 - #^(.5)] &), DataRange -> {{1, 300}, {0, .35}}, PlotRange -> All] ListLinePlot[Total[bc, {2}], DataRange -> {1, 300}, GridLines -> {{nmax}, {}}, PlotRange -> All] ListLinePlot[Total[bc], DataRange -> {0, .35}, GridLines -> {{pmax}, {}}] Now that we have the joint posterior distribution, we can compute the posterior distribution of any function of n and p. For example the mean of the binomial distribution, n*p, which is tightly concentrated around mean of the data (3.875): Through[{Mean, StandardDeviation}[Times @@@ draws]] We can also compute the uncertainty associated with the binomial CDF (which connects us up with fitting functional forms): Needs["ErrorBarPlots`"] meansd = (Through[{Mean, StandardDeviation}[#]] & /@ Transpose[Function[{n, p}, Table[CDF[BinomialDistribution[n, p], x], {x, 0, 15}] // Evaluate] @@@ Take[draws, 1000]]); ErrorListPlot[Transpose[{ Transpose[{Range[0, 15], meansd[[All, 1]]}], ErrorBar[2 #] & /@ meansd[[All, 2]]}]] In Bayesian analysis, one uses "decision theory" to get a "point estimate". A point estimate minimizes the expected loss computed with respect to the posterior distribution, where the "loss function" characterizes the cost of choosing the point (n1,p1) when in fact (n0,p0) is true. For example, quadratic loss leads to choosing the posterior mean, while a "zero-one" loss function leads to choosing the maximum posterior value. The prior I chose above is intended to illustrate the analysis. In a real-world situation with real-world data, one should incorporate what one knows into the prior. For example, one might know from experience that the probability that p is greater than 1/2 is quite small; or one might know that n can't be larger than 35. Ultimately, if the thing you care about is sensitive to the prior, then you should think carefully about the appropriate prior. --Mark