MathGroup Archive: June 2009 [00092]

[Date Index] [Thread Index] [Author Index]

Re: matching misspelled names

To: mathgroup at smc.vnet.net
Subject: [mg100408] Re: [mg100306] matching misspelled names
From: Leonid Shifrin <lshifr at gmail.com>
Date: Tue, 2 Jun 2009 06:50:04 -0400 (EDT)
References: <200905311037.GAA17027@smc.vnet.net>

Hi Jess,

You will really need some metric that will determine you how close any 2
names are to each other. In this metric you will have to encode the
specifics related to middle names etc. Then, for each of the names in  the
list A you can sort the names in the list B according to this metric, and
take some number of the "closest" ones. The complexity of this algorithm
will be MN Log N where M is a length of the first list and N the length of
the second.

The complexity may be improved if you do it a little differently: hash the
names in the first list, merge both lists together, sort according to the
metric as above (just once), then traverse the sorted list and pick some
neighbor elements around each one which is originally from  the first list
(this you check with the above-mentioned hash table), and then from each
found sublist of names  remove the names from the first list that were
mistakingly included in the results by this algorithm.  In this case, the
complexity will be something of the order of C1*(M+N)Log(M+N) +C2*M, I
think.

Alternatively you could construct a matrix of "name distances" and use
clustering (via FindClusters, it has a form that accepts the distance
matrix), but I don't think this will be easier, since this is a somewhat
different problem and you may have a hard time disentangling and ordering
the clustering results.

Another alternative would be to use one of the training algorithms and train
it on a large body of valid names of the type found in list A, and then
apply the "spelling corrector" to a list B. A really simple but effective
example of this approach was given by Peter Norvig: <
http://norvig.com/spell-correct.html> (a good read for your problem in any
case, I think). I have translated his code into Mathematica some while ago,
so if you are interested let me know and I will send you the code.

Regards,
Leonid

On Sun, May 31, 2009 at 3:37 AM, Jess <jesscobrien at gmail.com> wrote:

> Hi,
>
> I would like to compare 2 very large lists of names to identify a
> shortlist of possible matches where someone from the list A appears in
> the list B.
>
> However as English is not the local language, the most names have many
> spelling alternatives. Also in different contexts, the same person is
> referred to by the full name with one or more middle names and family
> names or just by a smaller combination of these. I imagine comparing
> lists with one or few typos is quite simple. But is there a way to do
> this in Mathematica which can also handle the type of variations I've
> outlined?
>
> I was thinking of arranging the names into clusters, isolating those
> clusters which include a list A person, and then generating lists of
> the closest matches for each cluster around a list A person. Is there
> a simple way to do this or a better way?
>
> Thanks,
> Jess
>
>

Prev by Date: Reverse Geocoding code

Next by Date: Re: Image[], Graphics[Raster[]]

Previous by thread: Re: matching misspelled names

Next by thread: proper way of posing a non-autonomous ode in mathematica?