       Re: Efficient Histogram Algorithm?

• To: mathgroup at smc.vnet.net
• Subject: [mg113083] Re: Efficient Histogram Algorithm?
• From: Darren Glosemeyer <darreng at wolfram.com>
• Date: Tue, 12 Oct 2010 13:49:42 -0400 (EDT)

```  On 10/12/2010 3:26 AM, Ray Koopman wrote:
> On Oct 11, 2:17 am, Julian Francis<julian.w.fran... at gmail.com>  wrote:
>> Dear all,
>>
>> Does anyone know of a way to implement a fast histogram algorithm in
>> Mathematica?
>>
>> My first attempt (In44) was based around a table and using the count
>> function. A bit of thought indicates that this might not be very
>> efficient as it would seem to involve examining the list 255 times,
>> when only once is necessary. The second attempt involved using the
>> built in library function BinCounts, but surprisingly this was only
>> about twice as fast as my initial naive version, which makes me wonder
>> if it is as efficient as it could be. I can assume for my purposes
>> that the list will always have integer values between 0 and 255, and
>> that there are 256 bins.
>>
>> I wasn't able to think of a way to implement the histogram function in
>> a functional programming style that was also efficient, although it is
>> fairly easy to write in an imperative language. This may reflect the
>> vast majority of my programming experience has been in imperative
>> languages.
>>
>> I am using an old version of Mathematica (v4.1) (but on a modern PC);
>> but if anyone has dramatically different times for the BinCounts
>> function, then maybe it is time for me to upgrade.
>>
>> My motivation is a pattern recognition algorithm I am implementing,
>> but it requires a large number of histograms to be created, hence the
>> running time of an individual histogram is actually quite important.
>>
>> Thanks for any help,
>> Julian.
>>
>> In:=
>> jhistogram[list_] := Table[Count[list, n], {n, 0, 255}]
>>
>> In:=
>> \!\(Timing[\(randomList = Table[Random[Integer, 255], {10\^6}];\)]\)
>>
>> Out=
>> {1.373 Second, Null}
>>
>> In:=
>> Timing[hist = jhistogram[randomList];]
>>
>> Out=
>> {2.34 Second, Null}
>>
>> In:=
>> <<  Statistics`
>>
>> In:=
>> Timing[hist = BinCounts[randomList, {0, 255}];]
>>
>> Out=
>> {0.936 Second, Null}
> Here's a compiled routine that's twice as fast as BinCounts. Also,
> note that you need to give BinCounts a non-intuitive set of bins.
>
> In:=<<  Statistics`
>
> In:= jhistogram[list_] := Table[Count[list,n],{n,0,255}]
>
> In:= freqs = Compile[{{data,_Integer,1}},
>          Module[{bins = Table[0,{256}]},
>          Scan[bins[[#]]++&,data+1]; bins]];
>
> In:= Timing@Length[randomList = Table[Random[Integer,255],{10^6}]]
>          Timing@Length[hist1 = jhistogram[randomList]]
>          Timing@Length[hist2 = BinCounts[randomList,{-1,255}]]
>          Timing@Length[hist3 = freqs[randomList]]
>          SameQ[hist1,hist2,hist3]
>
> Out= {1.26 Second,1000000}
> Out= {2.5 Second,256}
> Out= {0.9 Second,256}
> Out= {0.45 Second,256}
> Out= True
>

BinCounts is really a function for continuous values. Given that we know
the values of interest are integers from 0 to 255, this could be made
even faster by using Tally (added in version 6).

In:= jhistogram[list_] := Table[Count[list, n], {n, 0, 255}]

In:= freqs = Compile[{{data, _Integer, 1}},
Module[{bins = Table[0, {256}]}, Scan[bins[[#]]++ &, data + 1];
bins]];

In:= Timing@Length[randomList = Table[Random[Integer, 255], {10^6}]]

Out= {0.391, 1000000}

In:= Timing@Length[hist1 = jhistogram[randomList]]

Out= {0.531, 256}

In:= tallybased = With[{tallies = Tally[randomList]},
Split[
Sort[Join[tallies,
Transpose[{Range[0, 255],
ConstantArray[0,
256]}]]], #[] == #2[] &][[All, -1, -1]]]; //
Timing

-17
Out= {1.40946 10   , Null}

In:= SameQ[hist1, tallybased]

Out= True

Tally gives a list of {value, count} pairs for the elements in the list.
The code above works by getting the tallies for the values in list,
adding elements of the form {i, 0} (this is done so we can make sure a 0
count is present if the value is not in the list) for i from 0 to 255,
sorting so like i values are next to each other and the 0 count will
come first, splitting based on the i value, and picking out the count
from the last {i,count} pair in the group for each i so we only get the