Re: Database Challenge
- To: mathgroup at smc.vnet.net
- Subject: [mg106119] Re: Database Challenge
- From: David Reiss <dbreiss at gmail.com>
- Date: Sat, 2 Jan 2010 05:03:52 -0500 (EST)
- References: <hhkj95$56r$1@smc.vnet.net>
In[28]:= data = Import["/Users/dreiss/Desktop/data.csv"]
Out[28]= {{"025-60-4044 ", " joe average "}, {"004-16-4077 ",
" jane doe "}, {"014-27-9076 ",
" mike smith "}, {"098-43-2098 ",
" rodolfo pilas "}, {"073-15-6005 ",
" gustavo boksar "}, {"004-16-4077 ",
" jane a. doe "}, {"147-79-9074 ",
" bea busaniche "}, {"165-63-0189 ",
" pablo medrano "}, {"124-96-7092 ",
" jeff aaron "}, {"004-16-4077 ",
" jane anne doe "}, {"172-30-6069 ",
" michael peters "}, {"059-85-1062 ", " leroy baker "}}
In[31]:= First /@ GatherBy[data, First] // Length
Out[31]= 10
On Jan 1, 5:37 am, Nicholas Kormanik <nkorma... at gmail.com> wrote:
> There are 12 records in this mini database. Two columns. First
> column are social security numbers. Second column are names.
> Unfortunately Jane Doe appears three times, with three different
> versions of her name, but having the same social security number.
>
> Challenge: Remove the duplicates, where social security is the same,
> and keep any one of the names. Final result will be whittled down to
> 10 records.
>
> (Real life problem has 6.5 million records, and lots of duplicates,
> with various versions of names.)
>
> 025-60-4044 joe average
> 004-16-4077 jane doe
> 014-27-9076 mike smith
> 098-43-2098 rodolfo pilas
> 073-15-6005 gustavo boksar
> 004-16-4077 jane a. doe
> 147-79-9074 bea busaniche
> 165-63-0189 pablo medrano
> 124-96-7092 jeff aaron
> 004-16-4077 jane anne doe
> 172-30-6069 michael peters
> 059-85-1062 leroy baker