Re: Database Challenge
- To: mathgroup at smc.vnet.net
- Subject: [mg106128] Re: Database Challenge
- From: dh <dh at metrohm.com>
- Date: Sat, 2 Jan 2010 05:05:34 -0500 (EST)
- References: <hhkj95$56r$1@smc.vnet.net>
Hi Nicholas, you may achieve this using Union with SameTest: dat = { {"025-60-4044" , "joe average"}, {"004-16-4077 ", "jane doe"}, {"014-27-9076" , " mike smith"}, {"098-43-2098 " , "rodolfo pilas"}, {"073-15-6005 " , "gustavo boksar"}, {"004-16-4077 " , "jane a.doe"}, {"147-79-9074 " , "bea busaniche"}, {"165-63-0189 " , "pablo medrano"}, {"124-96-7092 " , "jeff aaron"}, {"004-16-4077 " , "jane anne doe"}, {"172-30-6069" , " michael peters"}, {"059-85-1062" , " leroy baker"} }; Union[dat, SameTest -> (#1[[1]] == #2[[1]] &)] Daniel On 1 Jan., 11:37, Nicholas Kormanik <nkorma... at gmail.com> wrote: > There are 12 records in this mini database. Two columns. First > column are social security numbers. Second column are names. > Unfortunately Jane Doe appears three times, with three different > versions of her name, but having the same social security number. > > Challenge: Remove the duplicates, where social security is the same, > and keep any one of the names. Final result will be whittled down to > 10 records. > > (Real life problem has 6.5 million records, and lots of duplicates, > with various versions of names.) > > 025-60-4044 joe average > 004-16-4077 jane doe > 014-27-9076 mike smith > 098-43-2098 rodolfo pilas > 073-15-6005 gustavo boksar > 004-16-4077 jane a. doe > 147-79-9074 bea busaniche > 165-63-0189 pablo medrano > 124-96-7092 jeff aaron > 004-16-4077 jane anne doe > 172-30-6069 michael peters > 059-85-1062 leroy baker