MathGroup Archive: February 2012 [00383]

[Date Index] [Thread Index] [Author Index]

Re: How best to implement a hash table in Mathematica

To: mathgroup at smc.vnet.net
Subject: [mg125059] Re: How best to implement a hash table in Mathematica
From: Joseph Gwinn <joegwinn at comcast.net>
Date: Mon, 20 Feb 2012 02:46:44 -0500 (EST)
Delivered-to: l-mathgroup@mail-archive0.wolfram.com
References: <jho21j$5sb$1@smc.vnet.net> <jhqmrr$fbq$1@smc.vnet.net>

In article <jhqmrr$fbq$1 at smc.vnet.net>,
 David Bailey <dave at removedbailey.co.uk> wrote:

> On 18/02/2012 11:28, Joseph Gwinn wrote:
> > I have an application that resembles dictionary lookup, with significant
> > rates of collision.
> >
> > The current approach is to initialize the hash table, a 10^4 element
> > list of empty sublists, and then build the table by inserting document
> > object id numbers into the correct sublists as indexed by some important
> > property of the object.  The choice of property varies with application.
> > The property may be quantized geographic location, or some attribute of
> > the object itself.
> >
> > When done building the hash table, query consists of reading the sublist
> > using an index derived in the same way as for the original building of
> > the list.  Each sublist may have many entries, and the entries need not
> > be numbers.
> >
> > Anyway, I have a workable approach using
> >
> > HashList[[index]]=
> > Union[Flatten[Join[HashList[[index]],{new object ID}]]]]
> >
> > as the core update function, but I'm wondering how well this will scale
> > to say 10^6 sublists (which may be needed to keep the collisions under
> > control with large datasets), and of course if there is a better way in
> > Mathematica.
> >
> > I did look at SparseArrays, but they don't support elements that are
> > sublists.
> >
> > Although this hash table is one dimensional, I have uses for 2D and
> > maybe 3D tables, which are very common when one is hashing physical
> > location.
> >
> > Thanks in advance,
> >
> > Joe Gwinn
> >
> Before constructing a hash table (other than for pedagogical reasons), I 
> would try the built in mechanisms of Mathematica. For example, here is a 
> trivial loop which creates a 'hash table' called hashtable, with values 
> f[key]:
> 
> kk=1;
> While[kk < 100000,
> hashtable[kk]=f[kk];
> kk++;
> ];
> 
> Of course, in this example the keys to the table are just the integers, 
> but the built-in mechanism will work with keys of any structure.

The table has to be updated more or less at random as the dataset is 
digested.  What I'm doing is essentially your above method, but not all 
elements will ever receive a value.  In fact, most will not.

The reason to initialize the hashtable in advance is so a later probe to 
an empty table element yields {}, versus repeating the question back.

> When a function is defined in this way as lots of individual values, 
> Mathematica uses techniques (probably hash) to access the values. Try it 
> and see how efficient it is!

Mathematica does uses hash tables internally, to good effect.  There is 
a discussion of this with respect to internal implementation of sparse 
arrays, but I assume that hash tables are far more widely used inside 
Mathematica.

For the record, here is my core hashtable algorithm, with transient 
details omitted:

HashIndexModulus = 10000;   

HashTable=Table[{},{HashIndexModulus}];  

HashValuesList =  Map[HashingFunction, dataset];

Do[
  Do[HashTable[[i]] = Join[HashTable[[i]], {j}] // Union, {i, 
   HashValuesList[[j]]}
   ],
  {j, Length[HashValuesList]}
  ];

The dataset in the above case is a list containing thousands of strings, 
the strings being on average 100 characters long.

The code takes about 3 seconds on a 3.2 GHz intel iMac, processing about 
1200 strings, and the computation time does not much vary with the size 
of the hashtable.  I would expect it to instead vary directly with the 
total number of characters in the dataset.  

Above 10^4 elements, creating the empty hashtable is linear in the 
length of the hashtable, costing on average 53 to 54 nanoseconds per 
element.  Below 10^4, elements, one can see evidence of internal 
gearshifting within the kernel.  In any event, it takes 55 milliseconds 
to init a 10^6 element hashtable, so this isn't a problem.

Joe Gwinn

Prev by Date: Re: Reading Single and Double reals in Little-endian and Big-endian

Next by Date: Re: How to change individual colors of an image

Previous by thread: Re: How best to implement a hash table in Mathematica

Next by thread: Re: How best to implement a hash table in Mathematica