Re: How best to implement a hash table in Mathematica
- To: mathgroup at smc.vnet.net
- Subject: [mg125119] Re: How best to implement a hash table in Mathematica
- From: Leonid Shifrin <lshifr at gmail.com>
- Date: Wed, 22 Feb 2012 05:32:33 -0500 (EST)
- Delivered-to: l-mathgroup@mail-archive0.wolfram.com
- References: <jho21j$5sb$1@smc.vnet.net>
Hi Joe, I suspect that most of the time is spent it adding to sub-lists, rather than hashing proper. I suggest to use a DownValues-based hash-table with linked lists, as follows. Here is a linked list "API": ClearAll[linkedList, toLinkedList, fromLinkedList, addToList, pop, emptyList]; SetAttributes[linkedList, HoldAllComplete]; toLinkedList[data_List] := Fold[linkedList, linkedList[], data]; fromLinkedList[ll_linkedList] := List @@ Flatten[ll, Infinity, linkedList]; addToList[ll_, value_] := linkedList[ll, value]; pop[ll_] := Last@ll; emptyList[] := linkedList[]; Here is a simplistic hash table constructor (I exploit the fact that you don't have to delete key-value pairs): ClearAll[makeHash]; makeHash[] := Module[{hash}, hash[_] := emptyList[]; hash /: add[hash, key_, value_] := hash[key] = addToList[hash[key], value]; hash /: get[hash, key_] := fromLinkedList@hash[key]; Return[hash]]; Let's create a new hash table: In[10]:= hash = makeHash[] Out[10]= hash$777 Now, I will populate it with the words of built-in English dictionary, the keys being first two letters of a word: values = DictionaryLookup["*"]; keys = StringTake[#, Min[2, StringLength[#]]] & /@ values; It takes only half a second to fill the hash table, on my machine: In[14]:= (add[hash,##]&@@@Transpose[{keys,values}]);//Timing Out[14]= {0.515,Null} for 92518 words. Here is how you make a lookup: In[19]:= get[hash,"pr"]//Short Out[19]//Short= {practicabilities,practicability,practicable,practicably,<<1696>>,prussic,pry,prying} You can also use the hash table from System`Utilities`. The following is an analogous constructor based on that: ClearAll[makeHashAlt]; makeHashAlt[] := Module[{hash, innerHash}, innerHash = System`Utilities`HashTable[]; hash /: add[hash, key_, value_] := System`Utilities`HashTableAdd[innerHash, key, addToList[ If[System`Utilities`HashTableContainsQ[innerHash, key], With[{res = System`Utilities`HashTableGet[innerHash, key]}, System`Utilities`HashTableRemove[innerHash, key]; res], (* else *) emptyList[] ], value]]; hash /: get[hash, key_] /; ! System`Utilities`HashTableContainsQ[innerHash, key] := {}; hash /: get[hash, key_] := fromLinkedList@System`Utilities`HashTableGet[innerHash, key]; Return[hash]]; Create a new hash: In[48]:= newhash = makeHashAlt[] Out[48]= hash$1814 The performance will be slightly worse: In[49]:= (add[newhash,##]&@@@Transpose[{keys,values}]);//Timing Out[49]= {0.671,Null} You can use it in the same way as the first version. In[50]:= get[newhash,"pr"]//Short Out[50]//Short= {practicabilities,practicability,practicable,practicably,<<1696>>,prussic,pry,prying} The whole point here is that this should scale really well, since the insertion is always constant-time. For hashes based on plain lists, this will be worse as the lists for a given key grow larger (you will have an average insertion time be proportional roughly to the average length of the accumulated sublist). Hope this helps. Consider cross-posting to http://mathematica.stackexchange.com, to have some more eyes on your problems of interest. Cheers, Leonid On Mon, Feb 20, 2012 at 10:46 AM, Joseph Gwinn <joegwinn at comcast.net> wrote: > In article <jhqmrr$fbq$1 at smc.vnet.net>, > David Bailey <dave at removedbailey.co.uk> wrote: > > > On 18/02/2012 11:28, Joseph Gwinn wrote: > > > I have an application that resembles dictionary lookup, with > significant > > > rates of collision. > > > > > > The current approach is to initialize the hash table, a 10^4 element > > > list of empty sublists, and then build the table by inserting document > > > object id numbers into the correct sublists as indexed by some > important > > > property of the object. The choice of property varies with > application. > > > The property may be quantized geographic location, or some attribute of > > > the object itself. > > > > > > When done building the hash table, query consists of reading the > sublist > > > using an index derived in the same way as for the original building of > > > the list. Each sublist may have many entries, and the entries need not > > > be numbers. > > > > > > Anyway, I have a workable approach using > > > > > > HashList[[index]]= > > > Union[Flatten[Join[HashList[[index]],{new object ID}]]]] > > > > > > as the core update function, but I'm wondering how well this will scale > > > to say 10^6 sublists (which may be needed to keep the collisions under > > > control with large datasets), and of course if there is a better way in > > > Mathematica. > > > > > > I did look at SparseArrays, but they don't support elements that are > > > sublists. > > > > > > Although this hash table is one dimensional, I have uses for 2D and > > > maybe 3D tables, which are very common when one is hashing physical > > > location. > > > > > > Thanks in advance, > > > > > > Joe Gwinn > > > > > Before constructing a hash table (other than for pedagogical reasons), I > > would try the built in mechanisms of Mathematica. For example, here is a > > trivial loop which creates a 'hash table' called hashtable, with values > > f[key]: > > > > kk=1; > > While[kk < 100000, > > hashtable[kk]=f[kk]; > > kk++; > > ]; > > > > Of course, in this example the keys to the table are just the integers, > > but the built-in mechanism will work with keys of any structure. > > The table has to be updated more or less at random as the dataset is > digested. What I'm doing is essentially your above method, but not all > elements will ever receive a value. In fact, most will not. > > The reason to initialize the hashtable in advance is so a later probe to > an empty table element yields {}, versus repeating the question back. > > > > When a function is defined in this way as lots of individual values, > > Mathematica uses techniques (probably hash) to access the values. Try it > > and see how efficient it is! > > Mathematica does uses hash tables internally, to good effect. There is > a discussion of this with respect to internal implementation of sparse > arrays, but I assume that hash tables are far more widely used inside > Mathematica. > > > For the record, here is my core hashtable algorithm, with transient > details omitted: > > HashIndexModulus = 10000; > > HashTable=Table[{},{HashIndexModulus}]; > > HashValuesList = Map[HashingFunction, dataset]; > > Do[ > Do[HashTable[[i]] = Join[HashTable[[i]], {j}] // Union, {i, > HashValuesList[[j]]} > ], > {j, Length[HashValuesList]} > ]; > > The dataset in the above case is a list containing thousands of strings, > the strings being on average 100 characters long. > > The code takes about 3 seconds on a 3.2 GHz intel iMac, processing about > 1200 strings, and the computation time does not much vary with the size > of the hashtable. I would expect it to instead vary directly with the > total number of characters in the dataset. > > Above 10^4 elements, creating the empty hashtable is linear in the > length of the hashtable, costing on average 53 to 54 nanoseconds per > element. Below 10^4, elements, one can see evidence of internal > gearshifting within the kernel. In any event, it takes 55 milliseconds > to init a 10^6 element hashtable, so this isn't a problem. > > Joe Gwinn > >