Re: Efficient Histogram Algorithm?

• To: mathgroup at smc.vnet.net
• Subject: [mg113083] Re: Efficient Histogram Algorithm?
• From: Darren Glosemeyer <darreng at wolfram.com>
• Date: Tue, 12 Oct 2010 13:49:42 -0400 (EDT)

```  On 10/12/2010 3:26 AM, Ray Koopman wrote:
> On Oct 11, 2:17 am, Julian Francis<julian.w.fran... at gmail.com>  wrote:
>> Dear all,
>>
>> Does anyone know of a way to implement a fast histogram algorithm in
>> Mathematica?
>>
>> My first attempt (In44) was based around a table and using the count
>> function. A bit of thought indicates that this might not be very
>> efficient as it would seem to involve examining the list 255 times,
>> when only once is necessary. The second attempt involved using the
>> built in library function BinCounts, but surprisingly this was only
>> about twice as fast as my initial naive version, which makes me wonder
>> if it is as efficient as it could be. I can assume for my purposes
>> that the list will always have integer values between 0 and 255, and
>> that there are 256 bins.
>>
>> I wasn't able to think of a way to implement the histogram function in
>> a functional programming style that was also efficient, although it is
>> fairly easy to write in an imperative language. This may reflect the
>> vast majority of my programming experience has been in imperative
>> languages.
>>
>> I am using an old version of Mathematica (v4.1) (but on a modern PC);
>> but if anyone has dramatically different times for the BinCounts
>> function, then maybe it is time for me to upgrade.
>>
>> My motivation is a pattern recognition algorithm I am implementing,
>> but it requires a large number of histograms to be created, hence the
>> running time of an individual histogram is actually quite important.
>>
>> Thanks for any help,
>> Julian.
>>
>> In[44]:=
>> jhistogram[list_] := Table[Count[list, n], {n, 0, 255}]
>>
>> In[51]:=
>> \!\(Timing[\(randomList = Table[Random[Integer, 255], {10\^6}];\)]\)
>>
>> Out[51]=
>> {1.373 Second, Null}
>>
>> In[54]:=
>> Timing[hist = jhistogram[randomList];]
>>
>> Out[54]=
>> {2.34 Second, Null}
>>
>> In[56]:=
>> <<  Statistics`
>>
>> In[57]:=
>> Timing[hist = BinCounts[randomList, {0, 255}];]
>>
>> Out[57]=
>> {0.936 Second, Null}
> Here's a compiled routine that's twice as fast as BinCounts. Also,
> note that you need to give BinCounts a non-intuitive set of bins.
>
> In[1]:=<<  Statistics`
>
> In[2]:= jhistogram[list_] := Table[Count[list,n],{n,0,255}]
>
> In[3]:= freqs = Compile[{{data,_Integer,1}},
>          Module[{bins = Table[0,{256}]},
>          Scan[bins[[#]]++&,data+1]; bins]];
>
> In[4]:= Timing@Length[randomList = Table[Random[Integer,255],{10^6}]]
>          Timing@Length[hist1 = jhistogram[randomList]]
>          Timing@Length[hist2 = BinCounts[randomList,{-1,255}]]
>          Timing@Length[hist3 = freqs[randomList]]
>          SameQ[hist1,hist2,hist3]
>
> Out[4]= {1.26 Second,1000000}
> Out[5]= {2.5 Second,256}
> Out[6]= {0.9 Second,256}
> Out[7]= {0.45 Second,256}
> Out[8]= True
>

BinCounts is really a function for continuous values. Given that we know
the values of interest are integers from 0 to 255, this could be made
even faster by using Tally (added in version 6).

In[1]:= jhistogram[list_] := Table[Count[list, n], {n, 0, 255}]

In[2]:= freqs = Compile[{{data, _Integer, 1}},
Module[{bins = Table[0, {256}]}, Scan[bins[[#]]++ &, data + 1];
bins]];

In[3]:= Timing@Length[randomList = Table[Random[Integer, 255], {10^6}]]

Out[3]= {0.391, 1000000}

In[4]:= Timing@Length[hist1 = jhistogram[randomList]]

Out[4]= {0.531, 256}

In[5]:= tallybased = With[{tallies = Tally[randomList]},
Split[
Sort[Join[tallies,
Transpose[{Range[0, 255],
ConstantArray[0,
256]}]]], #[[1]] == #2[[1]] &][[All, -1, -1]]]; //
Timing

-17
Out[5]= {1.40946 10   , Null}

In[6]:= SameQ[hist1, tallybased]

Out[6]= True

Tally gives a list of {value, count} pairs for the elements in the list.
The code above works by getting the tallies for the values in list,
adding elements of the form {i, 0} (this is done so we can make sure a 0
count is present if the value is not in the list) for i from 0 to 255,
sorting so like i values are next to each other and the 0 count will
come first, splitting based on the i value, and picking out the count
from the last {i,count} pair in the group for each i so we only get the