Re: matching misspelled names
- To: mathgroup at smc.vnet.net
- Subject: [mg100408] Re: [mg100306] matching misspelled names
- From: Leonid Shifrin <lshifr at gmail.com>
- Date: Tue, 2 Jun 2009 06:50:04 -0400 (EDT)
- References: <200905311037.GAA17027@smc.vnet.net>
Hi Jess, You will really need some metric that will determine you how close any 2 names are to each other. In this metric you will have to encode the specifics related to middle names etc. Then, for each of the names in the list A you can sort the names in the list B according to this metric, and take some number of the "closest" ones. The complexity of this algorithm will be MN Log N where M is a length of the first list and N the length of the second. The complexity may be improved if you do it a little differently: hash the names in the first list, merge both lists together, sort according to the metric as above (just once), then traverse the sorted list and pick some neighbor elements around each one which is originally from the first list (this you check with the above-mentioned hash table), and then from each found sublist of names remove the names from the first list that were mistakingly included in the results by this algorithm. In this case, the complexity will be something of the order of C1*(M+N)Log(M+N) +C2*M, I think. Alternatively you could construct a matrix of "name distances" and use clustering (via FindClusters, it has a form that accepts the distance matrix), but I don't think this will be easier, since this is a somewhat different problem and you may have a hard time disentangling and ordering the clustering results. Another alternative would be to use one of the training algorithms and train it on a large body of valid names of the type found in list A, and then apply the "spelling corrector" to a list B. A really simple but effective example of this approach was given by Peter Norvig: < http://norvig.com/spell-correct.html> (a good read for your problem in any case, I think). I have translated his code into Mathematica some while ago, so if you are interested let me know and I will send you the code. Regards, Leonid On Sun, May 31, 2009 at 3:37 AM, Jess <jesscobrien at gmail.com> wrote: > Hi, > > I would like to compare 2 very large lists of names to identify a > shortlist of possible matches where someone from the list A appears in > the list B. > > However as English is not the local language, the most names have many > spelling alternatives. Also in different contexts, the same person is > referred to by the full name with one or more middle names and family > names or just by a smaller combination of these. I imagine comparing > lists with one or few typos is quite simple. But is there a way to do > this in Mathematica which can also handle the type of variations I've > outlined? > > I was thinking of arranging the names into clusters, isolating those > clusters which include a list A person, and then generating lists of > the closest matches for each cluster around a list A person. Is there > a simple way to do this or a better way? > > Thanks, > Jess > >