MathGroup Archive: February 2012 [00427]

[Date Index] [Thread Index] [Author Index]

Re: How best to implement a hash table in Mathematica

To: mathgroup at smc.vnet.net
Subject: [mg125122] Re: How best to implement a hash table in Mathematica
From: Joseph Gwinn <joegwinn at comcast.net>
Date: Wed, 22 Feb 2012 05:33:36 -0500 (EST)
Delivered-to: l-mathgroup@mail-archive0.wolfram.com
References: <jho21j$5sb$1@smc.vnet.net> <jhqmrr$fbq$1@smc.vnet.net> <jhstsv$nk0$1@smc.vnet.net> <jhvulm$88e$1@smc.vnet.net>

In article <jhvulm$88e$1 at smc.vnet.net>, A Retey <awnl at gmx-topmail.de> 
wrote:

> Hi,
> 
> > The table has to be updated more or less at random as the dataset is
> > digested.  What I'm doing is essentially your above method, but not all
> > elements will ever receive a value.  In fact, most will not.
> >
> > The reason to initialize the hashtable in advance is so a later probe to
> > an empty table element yields {}, versus repeating the question back.
> 
> Honestly, despite your additional explanations I don't feel I fully 
> understand what you are trying to do, but the following assumes that it 
> is essential that you use your own hash function instead of what 
> Mathematica would internally use.

Two reasons.  First, the nature of the hash function should not affect 
how the hash table is built and subsequently probed.  Second, hash 
functions have mathematical properties, and not all hash functions are 
suited to any given hash function.  For instance, for handling many 
kinds of data, like natural language text, order matters, so the 
permutations of "abc" must all be distinct.  Well, three reasons:  The 
hash functions like md5 are far too heavy for the current purposes.

 
> You should note that David's suggestion to use DownValues to store the 
> key/value pairs will allow for returning empty lists for the keys that 
> have not been met yet, even without preallocating entries or a priori 
> knowledge of the hash index modulus. You would simply "initialize" with 
> a blank pattern (see second line):
> 
> ClearAll[hashtable];
> hashtable[_] = {};
> Module[{index},
>   Scan[(
>       index = hashingfunction[#];
>       hashtable[index] = Union[Join[hashtable[index], {#}]]
>       ) &,
>     dataset
>   ];
> ]
> 
> It is difficult to make a statement about runtime performance compared 
> to your approach, since it will depend on the hash function you are 
> using. But I see some advantages beside runtime performance: the 
> approach will work without knowledge of the hash index modulus and it 
> will be more memory efficient if the modulus is very large and thus 
> there are many empty values (in general I think that a large modulus is 
> what you usually are after when looking for a good hash function).
> 
> You might want to try the above with your hash function, with
> 
> hashingfunction = Hash;
> dataset = DictionaryLookup["*t*"];
> 
> the above will take 0.343 Seconds for about 41000 strings on my machine, 
> which is presumably much slower than your machine. Using StringLength as 
> hashfunction slows things down considerable, most probably because of 
> the need to reallocate memory for the growing value arrays. I think that 
> this is also dominating the runtime of your approach, and because 
> everything is in one large array even more so. For my example with 
> hashfunction=StringLength building the hash table using DownValues seems 
> to be a lot faster...

I have an adequate solution for my present purposes, but wanted to know 
if there were better ways.  I've gotten some good suggestions.  I plan 
to try them all and publish the results on this newsgroup.

Joe Gwinn

Prev by Date: Re: How add a menu item with a menu key using an init.m

Next by Date: Re: Complex and Solve

Previous by thread: Re: How best to implement a hash table in Mathematica

Next by thread: Re: How best to implement a hash table in Mathematica