Re: How best to implement a hash table in Mathematica
- To: mathgroup at smc.vnet.net
- Subject: [mg125102] Re: How best to implement a hash table in Mathematica
- From: A Retey <awnl at gmx-topmail.de>
- Date: Tue, 21 Feb 2012 06:16:12 -0500 (EST)
- Delivered-to: l-mathgroup@mail-archive0.wolfram.com
- References: <jho21j$5sb$1@smc.vnet.net> <jhqmrr$fbq$1@smc.vnet.net> <jhstsv$nk0$1@smc.vnet.net>
Hi,
> The table has to be updated more or less at random as the dataset is
> digested. What I'm doing is essentially your above method, but not all
> elements will ever receive a value. In fact, most will not.
>
> The reason to initialize the hashtable in advance is so a later probe to
> an empty table element yields {}, versus repeating the question back.
Honestly, despite your additional explanations I don't feel I fully
understand what you are trying to do, but the following assumes that it
is essential that you use your own hash function instead of what
Mathematica would internally use.
You should note that David's suggestion to use DownValues to store the
key/value pairs will allow for returning empty lists for the keys that
have not been met yet, even without preallocating entries or a priori
knowledge of the hash index modulus. You would simply "initialize" with
a blank pattern (see second line):
ClearAll[hashtable];
hashtable[_] = {};
Module[{index},
Scan[(
index = hashingfunction[#];
hashtable[index] = Union[Join[hashtable[index], {#}]]
) &,
dataset
];
]
It is difficult to make a statement about runtime performance compared
to your approach, since it will depend on the hash function you are
using. But I see some advantages beside runtime performance: the
approach will work without knowledge of the hash index modulus and it
will be more memory efficient if the modulus is very large and thus
there are many empty values (in general I think that a large modulus is
what you usually are after when looking for a good hash function).
You might want to try the above with your hash function, with
hashingfunction = Hash;
dataset = DictionaryLookup["*t*"];
the above will take 0.343 Seconds for about 41000 strings on my machine,
which is presumably much slower than your machine. Using StringLength as
hashfunction slows things down considerable, most probably because of
the need to reallocate memory for the growing value arrays. I think that
this is also dominating the runtime of your approach, and because
everything is in one large array even more so. For my example with
hashfunction=StringLength building the hash table using DownValues seems
to be a lot faster...
hth,
albert