MathGroup Archive: April 2007 [00734]

[Date Index] [Thread Index] [Author Index]

Re: Finding cardinality of set based on random selection from set

To: mathgroup at smc.vnet.net
Subject: [mg75435] Re: Finding cardinality of set based on random selection from set
From: Ray Koopman <koopman at sfu.ca>
Date: Mon, 30 Apr 2007 03:43:04 -0400 (EDT)
References: <26face530704242008sc104934o891814d2be51bcf1@mail.gmail.com>

On Apr 28, 2:55 am, "Kelly Jones" <kelly.terry.jo... at gmail.com> wrote:
> How do I use Mathematica to solve this problem?
>
> I selected with replacement 100 times from a set of N objects.
> The results:
>
>  1 item came up 4 times
>  1 item came up 3 times
> 10 items came up 2 times each
> 73 items came up 1 time each
> the other N-85 items did not come up at all
>
> Based on this histogram, what's the
> 1) most likely value of N, and
> 2) the probability distribution of N?
>
> Of course, I'm actually interested in the general solution:
> if I select with replacement M times from a set of N objects,
> and get a resulting histogram H, what's the probability
> distribution of N with respect to M and H?

Mathematica can do the scutwork for getting P[H|N], the conditional
probability of the histogram given the set size; but P[N|H], the
conditional probability of the set size given the histogram, is a
Bayesian concept that requires an a priori estimate of the
probability
distribution of the set sizes.

I'm switching all the variables to lowercase to avoid problems
implementing this in Mathematica. You'll have to make up your own
names for the several different probability functions "P" that are
distinguished here by the symbols used for their arguments.

Let s be a list in which s[[i]] = the number of times object i was
selected. Then Length@s = n = set size, Total@s = m = number of
draws,
Total@Sign@s = k = the number of different objects selected, and the
probability of s is P[s] = Multinomial@@s / n^m.

Let h = Count[s,#]&/@Range@Max@s. Then h[[j]] = the number of items
that came up j times, Total@h = k, h.Range@Length@h = m, and the
probability of h is P[h|n] = P[s]*R[s], where R[s] = the number of
rearrangements of s that give the same h.

Both P[s] and R[s] can be expressed in terms of h,k,m,n,
without reference to s itself, which is not observed.

Multinomial@@s = Multinomial@@t,
where t = Flatten@Table[j,{j,Length@h},{h[[j]]}],
and R[s] = Binomial[n,k] Multinomial@@h,
so we end up with

P[h|n] = Multinomial@@t Binomial[n,k] Multinomial@@h / n^m.

For h = {73,10,1,1}, the value of n that maximizes p[h|n] --
i.e., the maximum likelihood estimate of N -- is 296.

However, the maximum likelihood estimate of N is not the "most likely
value of N" that you requested. That would be the value of n that
maximizes the Bayesian posterior probability

             P[h|n]*P[n]
P[n|h] = --------------------.
         sum_n' P[h|n']*P[n']

To get P[n|h] you must supply a prior distribution P[n].

Prev by Date: elimination of a real variable from a complex function

Next by Date: Re: maximum entropy method for deconvolution

Previous by thread: Finding cardinality of set based on random selection from set

Next by thread: Fourier and InverseFourier