MathGroup Archive 2006

[Date Index] [Thread Index] [Author Index]

Search the Archive

Re: Reconciling BinCounts and RangeCounts

  • To: mathgroup at smc.vnet.net
  • Subject: [mg68044] Re: Reconciling BinCounts and RangeCounts
  • From: Bill Rowe <readnewsciv at earthlink.net>
  • Date: Fri, 21 Jul 2006 05:37:37 -0400 (EDT)
  • Sender: owner-wri-mathgroup at wolfram.com

On 7/20/06 at 6:04 AM, gregory.lypny at videotron.ca (Gregory Lypny)
wrote:

>I get a discrepancy between the results of BinCounts and RangeCounts
>and can confirm only that RangeCounts is, in fact, counting the
>number of instances where a number is at least as big as the lower
>cut-off and less than the upper cut-off.  Not so for BinCounts,
>which leads me to believe that it is buggy or, more likely, I am.

>I have a vector, x, with 7320 observations of real numbers in the
>range .06 to .14 with up to seven decimal places.  Here's what I get
>if I use bins or cut-offs of .01.

>First with BinCounts BinCounts[x, {.06, .14, .01}] {103, 333, 802,
>1266, 997, 662, 611, 2265, 281}

>Now with RangeCounts RangeCounts[x, Range[.07, .14, .01]] {103, 333,
>797, 1270, 997, 663, 611, 2265, 281}

>Notice that elements 3, 4, and 6 of the results differ.  So I tried
>to check what was going on by using Select and was able to confirm
>all of the RangeCounts elements.  For example, the third element of
>the RangeCounts results, 797, can be confirmed by using

>Length[Select[x, .08 =B2 # < .09 &]] >>>> returns 797

>However, the third element of the BinCounts results, 802, can be
>obtained only if I include the upper bound, .09, in the count as

>Length[Select[x, .08 =B2 # =B2 .09 &]] >>>>> returns 802,

>which of course makes no sense because we need a strong inequality
>for one of them.  But it gets worse.  When I go on to check elements
>4 and 6 of BinCounts, there is no combination of weak or strong
>inequalities that will give me the results 1266 and 662.

>Can anyone shed any light on this?  In the meantime, I think it
>safest to use RangeCounts.

It isn't the so much case of buggy code as it is the usage of machine precision numbers and the way the two functions do their thing.

Looking at the code for BinCounts the data list is converted to a list of integers using

Ceiling[(dataList - dataMin)/dx]

which clearly is now a list of bin numbers each element belongs to. The problem is there can be cases where the nearest machine number to (x-min)/dx will be less than n for integer n and others where the nearest machine number will be greater than n. That is Ceiling[x/y] doesn't always return what you expect for machine numbers x and y. So, the effective bin edge won't always be exactly where you are expecting.

In contrast, the code for RangeCounts builds a tree structure for the cutoffs using MakeTree (see DiscreteMath`Tree`) then maps TreeFind using this tree to each element of the data list. This has the effect of directly comparing the cutoff values with the elements of the data list. Something that doesn't occur with BinCounts.

Arguably, RangeCounts is more accurate and better since it is doing a direct comparison between the cuttoffs and the elements of the data list. But the price being paid is speed. That is

In[14]:=
data = Table[Random[], {10^4}]; 

In[15]:=
Timing[BinCounts[data, {0., 1., 0.1}]]

Out[15]=
{0.0075 Second, {1026, 972, 1043, 1049, 
   1010, 950, 912, 1002, 1045, 991}}

In[16]:=
Timing[RangeCounts[data, Range[0., 1., 0.1]]]

Out[16]=
{0.705 Second, {0, 1026, 972, 1043, 1049, 
   1010, 950, 912, 1002, 1045, 991, 0}}

And since the difference between the two will only be a few values, likely for most applications the difference is insignificant. 
--
To reply via email subtract one hundred and four


  • Prev by Date: Re: Reasonable integration speed? (24 hrs and counting)
  • Next by Date: Re: Reasonable integration speed? (24 hrs and counting)
  • Previous by thread: Re: Reconciling BinCounts and RangeCounts
  • Next by thread: Re: Reconciling BinCounts and RangeCounts