MathGroup Archive: February 2012 [00429]

[Date Index] [Thread Index] [Author Index]
Re: How best to implement a hash table in Mathematica
To: mathgroup at smc.vnet.net
Subject: [mg125119] Re: How best to implement a hash table in Mathematica
From: Leonid Shifrin <lshifr at gmail.com>
Date: Wed, 22 Feb 2012 05:32:33 -0500 (EST)
Delivered-to: l-mathgroup@mail-archive0.wolfram.com
References: <jho21j$5sb$1@smc.vnet.net>
Hi Joe,

I suspect that most of the time is spent it adding to sub-lists, rather
than hashing proper. I suggest to use a DownValues-based hash-table with
linked lists, as follows.

Here is a linked list "API":


ClearAll[linkedList, toLinkedList, fromLinkedList, addToList, pop,
  emptyList];
SetAttributes[linkedList, HoldAllComplete];
toLinkedList[data_List] := Fold[linkedList, linkedList[], data];
fromLinkedList[ll_linkedList] :=
  List @@ Flatten[ll, Infinity, linkedList];
addToList[ll_, value_] := linkedList[ll, value];
pop[ll_] := Last@ll;
emptyList[] := linkedList[];

Here is a simplistic hash table constructor (I exploit the fact that you
don't have to delete
key-value pairs):


ClearAll[makeHash];
makeHash[] :=
  Module[{hash},
   hash[_] := emptyList[];
   hash /: add[hash, key_, value_] :=
    hash[key] = addToList[hash[key], value];
   hash /: get[hash, key_] := fromLinkedList@hash[key];
   Return[hash]];


Let's create a new hash table:

In[10]:= hash = makeHash[]
Out[10]= hash$777

Now, I will populate it with the words of built-in English dictionary, the
keys being first two letters of a word:

values = DictionaryLookup["*"];
keys = StringTake[#, Min[2, StringLength[#]]] & /@ values;

It takes only half a second to fill the hash table, on my machine:

In[14]:= (add[hash,##]&@@@Transpose[{keys,values}]);//Timing
Out[14]= {0.515,Null}

for 92518 words.

Here is how you make a lookup:

In[19]:= get[hash,"pr"]//Short
Out[19]//Short=
{practicabilities,practicability,practicable,practicably,<<1696>>,prussic,pry,prying}

You can also use the hash table from System`Utilities`. The following is an
analogous constructor based on that:

ClearAll[makeHashAlt];
makeHashAlt[] :=
  Module[{hash, innerHash},
   innerHash =  System`Utilities`HashTable[];
   hash /: add[hash, key_, value_] :=
    System`Utilities`HashTableAdd[innerHash, key,
     addToList[
      If[System`Utilities`HashTableContainsQ[innerHash, key],
       With[{res = System`Utilities`HashTableGet[innerHash, key]},
         System`Utilities`HashTableRemove[innerHash, key];
        res],
       (* else *)
       emptyList[]
       ],
      value]];
   hash /:
    get[hash, key_] /; !
      System`Utilities`HashTableContainsQ[innerHash, key] := {};
   hash /: get[hash, key_] :=
    fromLinkedList@System`Utilities`HashTableGet[innerHash, key];
   Return[hash]];

Create a new hash:

In[48]:= newhash = makeHashAlt[]
Out[48]= hash$1814

The performance will be slightly worse:

In[49]:= (add[newhash,##]&@@@Transpose[{keys,values}]);//Timing
Out[49]= {0.671,Null}

You can use it in the same way as the first version.

In[50]:= get[newhash,"pr"]//Short
Out[50]//Short=
{practicabilities,practicability,practicable,practicably,<<1696>>,prussic,pry,prying}


The whole point here is that this should scale really well, since the
insertion is always constant-time.
For hashes based on plain lists, this will be worse as the lists for a
given key grow larger (you will have
an average insertion time be proportional roughly to the average length of
the accumulated sublist).

Hope this helps. Consider cross-posting to
http://mathematica.stackexchange.com, to have some
more eyes on your problems of interest.


Cheers,
Leonid





On Mon, Feb 20, 2012 at 10:46 AM, Joseph Gwinn <joegwinn at comcast.net> wrote:

> In article <jhqmrr$fbq$1 at smc.vnet.net>,
>  David Bailey <dave at removedbailey.co.uk> wrote:
>
> > On 18/02/2012 11:28, Joseph Gwinn wrote:
> > > I have an application that resembles dictionary lookup, with
> significant
> > > rates of collision.
> > >
> > > The current approach is to initialize the hash table, a 10^4 element
> > > list of empty sublists, and then build the table by inserting document
> > > object id numbers into the correct sublists as indexed by some
> important
> > > property of the object.  The choice of property varies with
> application.
> > > The property may be quantized geographic location, or some attribute of
> > > the object itself.
> > >
> > > When done building the hash table, query consists of reading the
> sublist
> > > using an index derived in the same way as for the original building of
> > > the list.  Each sublist may have many entries, and the entries need not
> > > be numbers.
> > >
> > > Anyway, I have a workable approach using
> > >
> > > HashList[[index]]=
> > > Union[Flatten[Join[HashList[[index]],{new object ID}]]]]
> > >
> > > as the core update function, but I'm wondering how well this will scale
> > > to say 10^6 sublists (which may be needed to keep the collisions under
> > > control with large datasets), and of course if there is a better way in
> > > Mathematica.
> > >
> > > I did look at SparseArrays, but they don't support elements that are
> > > sublists.
> > >
> > > Although this hash table is one dimensional, I have uses for 2D and
> > > maybe 3D tables, which are very common when one is hashing physical
> > > location.
> > >
> > > Thanks in advance,
> > >
> > > Joe Gwinn
> > >
> > Before constructing a hash table (other than for pedagogical reasons), I
> > would try the built in mechanisms of Mathematica. For example, here is a
> > trivial loop which creates a 'hash table' called hashtable, with values
> > f[key]:
> >
> > kk=1;
> > While[kk < 100000,
> > hashtable[kk]=f[kk];
> > kk++;
> > ];
> >
> > Of course, in this example the keys to the table are just the integers,
> > but the built-in mechanism will work with keys of any structure.
>
> The table has to be updated more or less at random as the dataset is
> digested.  What I'm doing is essentially your above method, but not all
> elements will ever receive a value.  In fact, most will not.
>
> The reason to initialize the hashtable in advance is so a later probe to
> an empty table element yields {}, versus repeating the question back.
>
>
> > When a function is defined in this way as lots of individual values,
> > Mathematica uses techniques (probably hash) to access the values. Try it
> > and see how efficient it is!
>
> Mathematica does uses hash tables internally, to good effect.  There is
> a discussion of this with respect to internal implementation of sparse
> arrays, but I assume that hash tables are far more widely used inside
> Mathematica.
>
>
> For the record, here is my core hashtable algorithm, with transient
> details omitted:
>
> HashIndexModulus = 10000;
>
> HashTable=Table[{},{HashIndexModulus}];
>
> HashValuesList =  Map[HashingFunction, dataset];
>
> Do[
>  Do[HashTable[[i]] = Join[HashTable[[i]], {j}] // Union, {i,
>   HashValuesList[[j]]}
>   ],
>  {j, Length[HashValuesList]}
>  ];
>
> The dataset in the above case is a list containing thousands of strings,
> the strings being on average 100 characters long.
>
> The code takes about 3 seconds on a 3.2 GHz intel iMac, processing about
> 1200 strings, and the computation time does not much vary with the size
> of the hashtable.  I would expect it to instead vary directly with the
> total number of characters in the dataset.
>
> Above 10^4 elements, creating the empty hashtable is linear in the
> length of the hashtable, costing on average 53 to 54 nanoseconds per
> element.  Below 10^4, elements, one can see evidence of internal
> gearshifting within the kernel.  In any event, it takes 55 milliseconds
> to init a 10^6 element hashtable, so this isn't a problem.
>
> Joe Gwinn
>
>
Prev by Date: Re: Complex and Solve
Next by Date: Re: Kolmogorov Smirnov in two or more dimensions is in Mathematica 8.0.4
Previous by thread: Re: How best to implement a hash table in Mathematica
Next by thread: Re: How best to implement a hash table in Mathematica