Re: How best to implement a hash table in Mathematica

• To: mathgroup at smc.vnet.net
• Subject: [mg125119] Re: How best to implement a hash table in Mathematica
• From: Leonid Shifrin <lshifr at gmail.com>
• Date: Wed, 22 Feb 2012 05:32:33 -0500 (EST)
• Delivered-to: l-mathgroup@mail-archive0.wolfram.com
• References: <jho21j\$5sb\$1@smc.vnet.net>

```Hi Joe,

I suspect that most of the time is spent it adding to sub-lists, rather
than hashing proper. I suggest to use a DownValues-based hash-table with

Here is a linked list "API":

emptyList];
pop[ll_] := Last@ll;

Here is a simplistic hash table constructor (I exploit the fact that you
don't have to delete
key-value pairs):

ClearAll[makeHash];
makeHash[] :=
Module[{hash},
hash[_] := emptyList[];
hash /: add[hash, key_, value_] :=
hash /: get[hash, key_] := fromLinkedList@hash[key];
Return[hash]];

Let's create a new hash table:

In[10]:= hash = makeHash[]
Out[10]= hash\$777

Now, I will populate it with the words of built-in English dictionary, the
keys being first two letters of a word:

values = DictionaryLookup["*"];
keys = StringTake[#, Min[2, StringLength[#]]] & /@ values;

It takes only half a second to fill the hash table, on my machine:

Out[14]= {0.515,Null}

for 92518 words.

Here is how you make a lookup:

In[19]:= get[hash,"pr"]//Short
Out[19]//Short=
{practicabilities,practicability,practicable,practicably,<<1696>>,prussic,pry,prying}

You can also use the hash table from System`Utilities`. The following is an
analogous constructor based on that:

ClearAll[makeHashAlt];
makeHashAlt[] :=
Module[{hash, innerHash},
innerHash =  System`Utilities`HashTable[];
hash /: add[hash, key_, value_] :=
If[System`Utilities`HashTableContainsQ[innerHash, key],
With[{res = System`Utilities`HashTableGet[innerHash, key]},
System`Utilities`HashTableRemove[innerHash, key];
res],
(* else *)
emptyList[]
],
value]];
hash /:
get[hash, key_] /; !
System`Utilities`HashTableContainsQ[innerHash, key] := {};
hash /: get[hash, key_] :=
Return[hash]];

Create a new hash:

In[48]:= newhash = makeHashAlt[]
Out[48]= hash\$1814

The performance will be slightly worse:

Out[49]= {0.671,Null}

You can use it in the same way as the first version.

In[50]:= get[newhash,"pr"]//Short
Out[50]//Short=
{practicabilities,practicability,practicable,practicably,<<1696>>,prussic,pry,prying}

The whole point here is that this should scale really well, since the
insertion is always constant-time.
For hashes based on plain lists, this will be worse as the lists for a
given key grow larger (you will have
an average insertion time be proportional roughly to the average length of
the accumulated sublist).

Hope this helps. Consider cross-posting to
http://mathematica.stackexchange.com, to have some
more eyes on your problems of interest.

Cheers,
Leonid

On Mon, Feb 20, 2012 at 10:46 AM, Joseph Gwinn <joegwinn at comcast.net> wrote:

> In article <jhqmrr\$fbq\$1 at smc.vnet.net>,
>  David Bailey <dave at removedbailey.co.uk> wrote:
>
> > On 18/02/2012 11:28, Joseph Gwinn wrote:
> > > I have an application that resembles dictionary lookup, with
> significant
> > > rates of collision.
> > >
> > > The current approach is to initialize the hash table, a 10^4 element
> > > list of empty sublists, and then build the table by inserting document
> > > object id numbers into the correct sublists as indexed by some
> important
> > > property of the object.  The choice of property varies with
> application.
> > > The property may be quantized geographic location, or some attribute of
> > > the object itself.
> > >
> > > When done building the hash table, query consists of reading the
> sublist
> > > using an index derived in the same way as for the original building of
> > > the list.  Each sublist may have many entries, and the entries need not
> > > be numbers.
> > >
> > > Anyway, I have a workable approach using
> > >
> > > HashList[[index]]=
> > > Union[Flatten[Join[HashList[[index]],{new object ID}]]]]
> > >
> > > as the core update function, but I'm wondering how well this will scale
> > > to say 10^6 sublists (which may be needed to keep the collisions under
> > > control with large datasets), and of course if there is a better way in
> > > Mathematica.
> > >
> > > I did look at SparseArrays, but they don't support elements that are
> > > sublists.
> > >
> > > Although this hash table is one dimensional, I have uses for 2D and
> > > maybe 3D tables, which are very common when one is hashing physical
> > > location.
> > >
> > > Thanks in advance,
> > >
> > > Joe Gwinn
> > >
> > Before constructing a hash table (other than for pedagogical reasons), I
> > would try the built in mechanisms of Mathematica. For example, here is a
> > trivial loop which creates a 'hash table' called hashtable, with values
> > f[key]:
> >
> > kk=1;
> > While[kk < 100000,
> > hashtable[kk]=f[kk];
> > kk++;
> > ];
> >
> > Of course, in this example the keys to the table are just the integers,
> > but the built-in mechanism will work with keys of any structure.
>
> The table has to be updated more or less at random as the dataset is
> digested.  What I'm doing is essentially your above method, but not all
> elements will ever receive a value.  In fact, most will not.
>
> The reason to initialize the hashtable in advance is so a later probe to
> an empty table element yields {}, versus repeating the question back.
>
>
> > When a function is defined in this way as lots of individual values,
> > Mathematica uses techniques (probably hash) to access the values. Try it
> > and see how efficient it is!
>
> Mathematica does uses hash tables internally, to good effect.  There is
> a discussion of this with respect to internal implementation of sparse
> arrays, but I assume that hash tables are far more widely used inside
> Mathematica.
>
>
> For the record, here is my core hashtable algorithm, with transient
> details omitted:
>
> HashIndexModulus = 10000;
>
> HashTable=Table[{},{HashIndexModulus}];
>
> HashValuesList =  Map[HashingFunction, dataset];
>
> Do[
>  Do[HashTable[[i]] = Join[HashTable[[i]], {j}] // Union, {i,
>   HashValuesList[[j]]}
>   ],
>  {j, Length[HashValuesList]}
>  ];
>
> The dataset in the above case is a list containing thousands of strings,
> the strings being on average 100 characters long.
>
> The code takes about 3 seconds on a 3.2 GHz intel iMac, processing about
> 1200 strings, and the computation time does not much vary with the size
> of the hashtable.  I would expect it to instead vary directly with the
> total number of characters in the dataset.
>
> Above 10^4 elements, creating the empty hashtable is linear in the
> length of the hashtable, costing on average 53 to 54 nanoseconds per
> element.  Below 10^4, elements, one can see evidence of internal
> gearshifting within the kernel.  In any event, it takes 55 milliseconds
> to init a 10^6 element hashtable, so this isn't a problem.
>
> Joe Gwinn
>
>

```

• Prev by Date: Re: Complex and Solve
• Next by Date: Re: Kolmogorov Smirnov in two or more dimensions is in Mathematica 8.0.4
• Previous by thread: Re: How best to implement a hash table in Mathematica
• Next by thread: Re: How best to implement a hash table in Mathematica