Re: Efficient Histogram Algorithm?

*To*: mathgroup at smc.vnet.net*Subject*: [mg113083] Re: Efficient Histogram Algorithm?*From*: Darren Glosemeyer <darreng at wolfram.com>*Date*: Tue, 12 Oct 2010 13:49:42 -0400 (EDT)

On 10/12/2010 3:26 AM, Ray Koopman wrote: > On Oct 11, 2:17 am, Julian Francis<julian.w.fran... at gmail.com> wrote: >> Dear all, >> >> Does anyone know of a way to implement a fast histogram algorithm in >> Mathematica? >> >> My first attempt (In44) was based around a table and using the count >> function. A bit of thought indicates that this might not be very >> efficient as it would seem to involve examining the list 255 times, >> when only once is necessary. The second attempt involved using the >> built in library function BinCounts, but surprisingly this was only >> about twice as fast as my initial naive version, which makes me wonder >> if it is as efficient as it could be. I can assume for my purposes >> that the list will always have integer values between 0 and 255, and >> that there are 256 bins. >> >> I wasn't able to think of a way to implement the histogram function in >> a functional programming style that was also efficient, although it is >> fairly easy to write in an imperative language. This may reflect the >> vast majority of my programming experience has been in imperative >> languages. >> >> I am using an old version of Mathematica (v4.1) (but on a modern PC); >> but if anyone has dramatically different times for the BinCounts >> function, then maybe it is time for me to upgrade. >> >> My motivation is a pattern recognition algorithm I am implementing, >> but it requires a large number of histograms to be created, hence the >> running time of an individual histogram is actually quite important. >> >> Thanks for any help, >> Julian. >> >> In[44]:= >> jhistogram[list_] := Table[Count[list, n], {n, 0, 255}] >> >> In[51]:= >> \!\(Timing[\(randomList = Table[Random[Integer, 255], {10\^6}];\)]\) >> >> Out[51]= >> {1.373 Second, Null} >> >> In[54]:= >> Timing[hist = jhistogram[randomList];] >> >> Out[54]= >> {2.34 Second, Null} >> >> In[56]:= >> << Statistics` >> >> In[57]:= >> Timing[hist = BinCounts[randomList, {0, 255}];] >> >> Out[57]= >> {0.936 Second, Null} > Here's a compiled routine that's twice as fast as BinCounts. Also, > note that you need to give BinCounts a non-intuitive set of bins. > > In[1]:=<< Statistics` > > In[2]:= jhistogram[list_] := Table[Count[list,n],{n,0,255}] > > In[3]:= freqs = Compile[{{data,_Integer,1}}, > Module[{bins = Table[0,{256}]}, > Scan[bins[[#]]++&,data+1]; bins]]; > > In[4]:= Timing@Length[randomList = Table[Random[Integer,255],{10^6}]] > Timing@Length[hist1 = jhistogram[randomList]] > Timing@Length[hist2 = BinCounts[randomList,{-1,255}]] > Timing@Length[hist3 = freqs[randomList]] > SameQ[hist1,hist2,hist3] > > Out[4]= {1.26 Second,1000000} > Out[5]= {2.5 Second,256} > Out[6]= {0.9 Second,256} > Out[7]= {0.45 Second,256} > Out[8]= True > BinCounts is really a function for continuous values. Given that we know the values of interest are integers from 0 to 255, this could be made even faster by using Tally (added in version 6). In[1]:= jhistogram[list_] := Table[Count[list, n], {n, 0, 255}] In[2]:= freqs = Compile[{{data, _Integer, 1}}, Module[{bins = Table[0, {256}]}, Scan[bins[[#]]++ &, data + 1]; bins]]; In[3]:= Timing@Length[randomList = Table[Random[Integer, 255], {10^6}]] Out[3]= {0.391, 1000000} In[4]:= Timing@Length[hist1 = jhistogram[randomList]] Out[4]= {0.531, 256} In[5]:= tallybased = With[{tallies = Tally[randomList]}, Split[ Sort[Join[tallies, Transpose[{Range[0, 255], ConstantArray[0, 256]}]]], #[[1]] == #2[[1]] &][[All, -1, -1]]]; // Timing -17 Out[5]= {1.40946 10 , Null} In[6]:= SameQ[hist1, tallybased] Out[6]= True Tally gives a list of {value, count} pairs for the elements in the list. The code above works by getting the tallies for the values in list, adding elements of the form {i, 0} (this is done so we can make sure a 0 count is present if the value is not in the list) for i from 0 to 255, sorting so like i values are next to each other and the 0 count will come first, splitting based on the i value, and picking out the count from the last {i,count} pair in the group for each i so we only get the actual counts for the list and not any additional padding 0s. Darren Glosemeyer Wolfram Research