MathGroup Archive: June 2009 [00047]

[Date Index] [Thread Index] [Author Index]

Re: matching misspelled names

To: mathgroup at smc.vnet.net
Subject: [mg100343] Re: [mg100306] matching misspelled names
From: Bob Hanlon <hanlonr at cox.net>
Date: Mon, 1 Jun 2009 07:10:28 -0400 (EDT)
Reply-to: hanlonr at cox.net

Look at     guide/SequenceAlignmentAndComparison

DamerauLevenshteinDistance[u,v] gives the number of one-element deletions, insertions, substitutions and transpositions required to transform u to v.

EditDistance[u,v] gives the number of one-element deletions, insertions, and substitutions required to transform u to v.

HammingDistance[u,v] gives the number of elements whose values disagree in u and v.

LongestCommonSequence[Subscript[s, 1],Subscript[s, 2]] 
finds the longest sequence of contiguous or disjoint elements common to the strings or lists Subscript[s, 1] and Subscript[s, 2].

LongestCommonSubsequence[Subscript[s, 1],Subscript[s, 2]] 
finds the longest contiguous subsequence of elements common to the strings or lists Subscript[s, 1] and Subscript[s, 2].

NeedlemanWunschSimilarity[u,v] finds an optimal global alignment between the elements of u and v, and returns the number of one-element matches.

SequenceAlignment[Subscript[s, 1],Subscript[s, 2]] 
finds an optimal alignment of sequences of elements in the strings or lists Subscript[s, 1] and Subscript[s, 2], and yields a list of successive matching and differing sequences.

SmithWatermanSimilarity[u,v] finds an optimal local alignment between the elements of u and v, and returns the number of one-element matches.


Bob Hanlon

---- Jess <jesscobrien at gmail.com> wrote: 

=============
Hi,

I would like to compare 2 very large lists of names to identify a
shortlist of possible matches where someone from the list A appears in
the list B.

However as English is not the local language, the most names have many
spelling alternatives. Also in different contexts, the same person is
referred to by the full name with one or more middle names and family
names or just by a smaller combination of these. I imagine comparing
lists with one or few typos is quite simple. But is there a way to do
this in Mathematica which can also handle the type of variations I've
outlined?

I was thinking of arranging the names into clusters, isolating those
clusters which include a list A person, and then generating lists of
the closest matches for each cluster around a list A person. Is there
a simple way to do this or a better way?

Thanks,
Jess

Prev by Date: Re: comments on Wolfram Alpha

Next by Date: Re: Perpendicular lines do not appear perpendicular

Previous by thread: Re: can SendMail use HTML and embedded images

Next by thread: Re: matching misspelled names