Re: Database Challenge
- To: mathgroup at smc.vnet.net
- Subject: [mg106119] Re: Database Challenge
- From: David Reiss <dbreiss at gmail.com>
- Date: Sat, 2 Jan 2010 05:03:52 -0500 (EST)
- References: <hhkj95$56r$1@smc.vnet.net>
In[28]:= data = Import["/Users/dreiss/Desktop/data.csv"] Out[28]= {{"025-60-4044 ", " joe average "}, {"004-16-4077 ", " jane doe "}, {"014-27-9076 ", " mike smith "}, {"098-43-2098 ", " rodolfo pilas "}, {"073-15-6005 ", " gustavo boksar "}, {"004-16-4077 ", " jane a. doe "}, {"147-79-9074 ", " bea busaniche "}, {"165-63-0189 ", " pablo medrano "}, {"124-96-7092 ", " jeff aaron "}, {"004-16-4077 ", " jane anne doe "}, {"172-30-6069 ", " michael peters "}, {"059-85-1062 ", " leroy baker "}} In[31]:= First /@ GatherBy[data, First] // Length Out[31]= 10 On Jan 1, 5:37 am, Nicholas Kormanik <nkorma... at gmail.com> wrote: > There are 12 records in this mini database. Two columns. First > column are social security numbers. Second column are names. > Unfortunately Jane Doe appears three times, with three different > versions of her name, but having the same social security number. > > Challenge: Remove the duplicates, where social security is the same, > and keep any one of the names. Final result will be whittled down to > 10 records. > > (Real life problem has 6.5 million records, and lots of duplicates, > with various versions of names.) > > 025-60-4044 joe average > 004-16-4077 jane doe > 014-27-9076 mike smith > 098-43-2098 rodolfo pilas > 073-15-6005 gustavo boksar > 004-16-4077 jane a. doe > 147-79-9074 bea busaniche > 165-63-0189 pablo medrano > 124-96-7092 jeff aaron > 004-16-4077 jane anne doe > 172-30-6069 michael peters > 059-85-1062 leroy baker