Re: How best to implement a hash table in Mathematica
- To: mathgroup at smc.vnet.net
- Subject: [mg125122] Re: How best to implement a hash table in Mathematica
- From: Joseph Gwinn <joegwinn at comcast.net>
- Date: Wed, 22 Feb 2012 05:33:36 -0500 (EST)
- Delivered-to: l-mathgroup@mail-archive0.wolfram.com
- References: <jho21j$5sb$1@smc.vnet.net> <jhqmrr$fbq$1@smc.vnet.net> <jhstsv$nk0$1@smc.vnet.net> <jhvulm$88e$1@smc.vnet.net>
In article <jhvulm$88e$1 at smc.vnet.net>, A Retey <awnl at gmx-topmail.de> wrote: > Hi, > > > The table has to be updated more or less at random as the dataset is > > digested. What I'm doing is essentially your above method, but not all > > elements will ever receive a value. In fact, most will not. > > > > The reason to initialize the hashtable in advance is so a later probe to > > an empty table element yields {}, versus repeating the question back. > > Honestly, despite your additional explanations I don't feel I fully > understand what you are trying to do, but the following assumes that it > is essential that you use your own hash function instead of what > Mathematica would internally use. Two reasons. First, the nature of the hash function should not affect how the hash table is built and subsequently probed. Second, hash functions have mathematical properties, and not all hash functions are suited to any given hash function. For instance, for handling many kinds of data, like natural language text, order matters, so the permutations of "abc" must all be distinct. Well, three reasons: The hash functions like md5 are far too heavy for the current purposes. > You should note that David's suggestion to use DownValues to store the > key/value pairs will allow for returning empty lists for the keys that > have not been met yet, even without preallocating entries or a priori > knowledge of the hash index modulus. You would simply "initialize" with > a blank pattern (see second line): > > ClearAll[hashtable]; > hashtable[_] = {}; > Module[{index}, > Scan[( > index = hashingfunction[#]; > hashtable[index] = Union[Join[hashtable[index], {#}]] > ) &, > dataset > ]; > ] > > It is difficult to make a statement about runtime performance compared > to your approach, since it will depend on the hash function you are > using. But I see some advantages beside runtime performance: the > approach will work without knowledge of the hash index modulus and it > will be more memory efficient if the modulus is very large and thus > there are many empty values (in general I think that a large modulus is > what you usually are after when looking for a good hash function). > > You might want to try the above with your hash function, with > > hashingfunction = Hash; > dataset = DictionaryLookup["*t*"]; > > the above will take 0.343 Seconds for about 41000 strings on my machine, > which is presumably much slower than your machine. Using StringLength as > hashfunction slows things down considerable, most probably because of > the need to reallocate memory for the growing value arrays. I think that > this is also dominating the runtime of your approach, and because > everything is in one large array even more so. For my example with > hashfunction=StringLength building the hash table using DownValues seems > to be a lot faster... I have an adequate solution for my present purposes, but wanted to know if there were better ways. I've gotten some good suggestions. I plan to try them all and publish the results on this newsgroup. Joe Gwinn