Re: matching misspelled names
- To: mathgroup at smc.vnet.net
- Subject: [mg100343] Re: [mg100306] matching misspelled names
- From: Bob Hanlon <hanlonr at cox.net>
- Date: Mon, 1 Jun 2009 07:10:28 -0400 (EDT)
- Reply-to: hanlonr at cox.net
Look at guide/SequenceAlignmentAndComparison DamerauLevenshteinDistance[u,v] gives the number of one-element deletions, insertions, substitutions and transpositions required to transform u to v. EditDistance[u,v] gives the number of one-element deletions, insertions, and substitutions required to transform u to v. HammingDistance[u,v] gives the number of elements whose values disagree in u and v. LongestCommonSequence[Subscript[s, 1],Subscript[s, 2]] finds the longest sequence of contiguous or disjoint elements common to the strings or lists Subscript[s, 1] and Subscript[s, 2]. LongestCommonSubsequence[Subscript[s, 1],Subscript[s, 2]] finds the longest contiguous subsequence of elements common to the strings or lists Subscript[s, 1] and Subscript[s, 2]. NeedlemanWunschSimilarity[u,v] finds an optimal global alignment between the elements of u and v, and returns the number of one-element matches. SequenceAlignment[Subscript[s, 1],Subscript[s, 2]] finds an optimal alignment of sequences of elements in the strings or lists Subscript[s, 1] and Subscript[s, 2], and yields a list of successive matching and differing sequences. SmithWatermanSimilarity[u,v] finds an optimal local alignment between the elements of u and v, and returns the number of one-element matches. Bob Hanlon ---- Jess <jesscobrien at gmail.com> wrote: ============= Hi, I would like to compare 2 very large lists of names to identify a shortlist of possible matches where someone from the list A appears in the list B. However as English is not the local language, the most names have many spelling alternatives. Also in different contexts, the same person is referred to by the full name with one or more middle names and family names or just by a smaller combination of these. I imagine comparing lists with one or few typos is quite simple. But is there a way to do this in Mathematica which can also handle the type of variations I've outlined? I was thinking of arranging the names into clusters, isolating those clusters which include a list A person, and then generating lists of the closest matches for each cluster around a list A person. Is there a simple way to do this or a better way? Thanks, Jess