Re: How best to implement a hash table in Mathematica
- To: mathgroup at smc.vnet.net
- Subject: [mg125119] Re: How best to implement a hash table in Mathematica
- From: Leonid Shifrin <lshifr at gmail.com>
- Date: Wed, 22 Feb 2012 05:32:33 -0500 (EST)
- Delivered-to: l-mathgroup@mail-archive0.wolfram.com
- References: <jho21j$5sb$1@smc.vnet.net>
Hi Joe,
I suspect that most of the time is spent it adding to sub-lists, rather
than hashing proper. I suggest to use a DownValues-based hash-table with
linked lists, as follows.
Here is a linked list "API":
ClearAll[linkedList, toLinkedList, fromLinkedList, addToList, pop,
emptyList];
SetAttributes[linkedList, HoldAllComplete];
toLinkedList[data_List] := Fold[linkedList, linkedList[], data];
fromLinkedList[ll_linkedList] :=
List @@ Flatten[ll, Infinity, linkedList];
addToList[ll_, value_] := linkedList[ll, value];
pop[ll_] := Last@ll;
emptyList[] := linkedList[];
Here is a simplistic hash table constructor (I exploit the fact that you
don't have to delete
key-value pairs):
ClearAll[makeHash];
makeHash[] :=
Module[{hash},
hash[_] := emptyList[];
hash /: add[hash, key_, value_] :=
hash[key] = addToList[hash[key], value];
hash /: get[hash, key_] := fromLinkedList@hash[key];
Return[hash]];
Let's create a new hash table:
In[10]:= hash = makeHash[]
Out[10]= hash$777
Now, I will populate it with the words of built-in English dictionary, the
keys being first two letters of a word:
values = DictionaryLookup["*"];
keys = StringTake[#, Min[2, StringLength[#]]] & /@ values;
It takes only half a second to fill the hash table, on my machine:
In[14]:= (add[hash,##]&@@@Transpose[{keys,values}]);//Timing
Out[14]= {0.515,Null}
for 92518 words.
Here is how you make a lookup:
In[19]:= get[hash,"pr"]//Short
Out[19]//Short=
{practicabilities,practicability,practicable,practicably,<<1696>>,prussic,pry,prying}
You can also use the hash table from System`Utilities`. The following is an
analogous constructor based on that:
ClearAll[makeHashAlt];
makeHashAlt[] :=
Module[{hash, innerHash},
innerHash = System`Utilities`HashTable[];
hash /: add[hash, key_, value_] :=
System`Utilities`HashTableAdd[innerHash, key,
addToList[
If[System`Utilities`HashTableContainsQ[innerHash, key],
With[{res = System`Utilities`HashTableGet[innerHash, key]},
System`Utilities`HashTableRemove[innerHash, key];
res],
(* else *)
emptyList[]
],
value]];
hash /:
get[hash, key_] /; !
System`Utilities`HashTableContainsQ[innerHash, key] := {};
hash /: get[hash, key_] :=
fromLinkedList@System`Utilities`HashTableGet[innerHash, key];
Return[hash]];
Create a new hash:
In[48]:= newhash = makeHashAlt[]
Out[48]= hash$1814
The performance will be slightly worse:
In[49]:= (add[newhash,##]&@@@Transpose[{keys,values}]);//Timing
Out[49]= {0.671,Null}
You can use it in the same way as the first version.
In[50]:= get[newhash,"pr"]//Short
Out[50]//Short=
{practicabilities,practicability,practicable,practicably,<<1696>>,prussic,pry,prying}
The whole point here is that this should scale really well, since the
insertion is always constant-time.
For hashes based on plain lists, this will be worse as the lists for a
given key grow larger (you will have
an average insertion time be proportional roughly to the average length of
the accumulated sublist).
Hope this helps. Consider cross-posting to
http://mathematica.stackexchange.com, to have some
more eyes on your problems of interest.
Cheers,
Leonid
On Mon, Feb 20, 2012 at 10:46 AM, Joseph Gwinn <joegwinn at comcast.net> wrote:
> In article <jhqmrr$fbq$1 at smc.vnet.net>,
> David Bailey <dave at removedbailey.co.uk> wrote:
>
> > On 18/02/2012 11:28, Joseph Gwinn wrote:
> > > I have an application that resembles dictionary lookup, with
> significant
> > > rates of collision.
> > >
> > > The current approach is to initialize the hash table, a 10^4 element
> > > list of empty sublists, and then build the table by inserting document
> > > object id numbers into the correct sublists as indexed by some
> important
> > > property of the object. The choice of property varies with
> application.
> > > The property may be quantized geographic location, or some attribute of
> > > the object itself.
> > >
> > > When done building the hash table, query consists of reading the
> sublist
> > > using an index derived in the same way as for the original building of
> > > the list. Each sublist may have many entries, and the entries need not
> > > be numbers.
> > >
> > > Anyway, I have a workable approach using
> > >
> > > HashList[[index]]=
> > > Union[Flatten[Join[HashList[[index]],{new object ID}]]]]
> > >
> > > as the core update function, but I'm wondering how well this will scale
> > > to say 10^6 sublists (which may be needed to keep the collisions under
> > > control with large datasets), and of course if there is a better way in
> > > Mathematica.
> > >
> > > I did look at SparseArrays, but they don't support elements that are
> > > sublists.
> > >
> > > Although this hash table is one dimensional, I have uses for 2D and
> > > maybe 3D tables, which are very common when one is hashing physical
> > > location.
> > >
> > > Thanks in advance,
> > >
> > > Joe Gwinn
> > >
> > Before constructing a hash table (other than for pedagogical reasons), I
> > would try the built in mechanisms of Mathematica. For example, here is a
> > trivial loop which creates a 'hash table' called hashtable, with values
> > f[key]:
> >
> > kk=1;
> > While[kk < 100000,
> > hashtable[kk]=f[kk];
> > kk++;
> > ];
> >
> > Of course, in this example the keys to the table are just the integers,
> > but the built-in mechanism will work with keys of any structure.
>
> The table has to be updated more or less at random as the dataset is
> digested. What I'm doing is essentially your above method, but not all
> elements will ever receive a value. In fact, most will not.
>
> The reason to initialize the hashtable in advance is so a later probe to
> an empty table element yields {}, versus repeating the question back.
>
>
> > When a function is defined in this way as lots of individual values,
> > Mathematica uses techniques (probably hash) to access the values. Try it
> > and see how efficient it is!
>
> Mathematica does uses hash tables internally, to good effect. There is
> a discussion of this with respect to internal implementation of sparse
> arrays, but I assume that hash tables are far more widely used inside
> Mathematica.
>
>
> For the record, here is my core hashtable algorithm, with transient
> details omitted:
>
> HashIndexModulus = 10000;
>
> HashTable=Table[{},{HashIndexModulus}];
>
> HashValuesList = Map[HashingFunction, dataset];
>
> Do[
> Do[HashTable[[i]] = Join[HashTable[[i]], {j}] // Union, {i,
> HashValuesList[[j]]}
> ],
> {j, Length[HashValuesList]}
> ];
>
> The dataset in the above case is a list containing thousands of strings,
> the strings being on average 100 characters long.
>
> The code takes about 3 seconds on a 3.2 GHz intel iMac, processing about
> 1200 strings, and the computation time does not much vary with the size
> of the hashtable. I would expect it to instead vary directly with the
> total number of characters in the dataset.
>
> Above 10^4 elements, creating the empty hashtable is linear in the
> length of the hashtable, costing on average 53 to 54 nanoseconds per
> element. Below 10^4, elements, one can see evidence of internal
> gearshifting within the kernel. In any event, it takes 55 milliseconds
> to init a 10^6 element hashtable, so this isn't a problem.
>
> Joe Gwinn
>
>