Re: Database Challenge
- To: mathgroup at smc.vnet.net
- Subject: [mg106124] Re: [mg106106] Database Challenge
- From: Adriano Pascoletti <adriano.pascoletti at dimi.uniud.it>
- Date: Sat, 2 Jan 2010 05:04:46 -0500 (EST)
- References: <201001011037.FAA05389@smc.vnet.net>
Copy and paste from your message to the string data, then compute
First /@ GatherBy[ImportString[data, "Lines"], StringTake[#, 11] &]
data = "025-60-4044 joe average
004-16-4077 jane doe
014-27-9076 mike smith
098-43-2098 rodolfo pilas
073-15-6005 gustavo boksar
004-16-4077 jane a. doe
147-79-9074 bea busaniche
165-63-0189 pablo medrano
124-96-7092 jeff aaron
004-16-4077 jane anne doe
172-30-6069 michael peters
059-85-1062 leroy baker";
First /@ GatherBy[ImportString[data, "Lines"], StringTake[#1, 11] &]
{"025-60-4044 joe average", "004-16-4077 jane doe",
"014-27-9076 mike smith", "098-43-2098 rodolfo pilas",
"073-15-6005 gustavo boksar", "147-79-9074 bea busaniche",
"165-63-0189 pablo medrano", "124-96-7092 jeff aaron",
"172-30-6069 michael peters", "059-85-1062 leroy baker"}
Adriano Pascoletti
2010/1/1 Nicholas Kormanik <nkormanik at gmail.com>
>
> There are 12 records in this mini database. Two columns. First
> column are social security numbers. Second column are names.
> Unfortunately Jane Doe appears three times, with three different
> versions of her name, but having the same social security number.
>
> Challenge: Remove the duplicates, where social security is the same,
> and keep any one of the names. Final result will be whittled down to
> 10 records.
>
> (Real life problem has 6.5 million records, and lots of duplicates,
> with various versions of names.)
>
>
> 025-60-4044 joe average
> 004-16-4077 jane doe
> 014-27-9076 mike smith
> 098-43-2098 rodolfo pilas
> 073-15-6005 gustavo boksar
> 004-16-4077 jane a. doe
> 147-79-9074 bea busaniche
> 165-63-0189 pablo medrano
> 124-96-7092 jeff aaron
> 004-16-4077 jane anne doe
> 172-30-6069 michael peters
> 059-85-1062 leroy baker
>
>
- References:
- Database Challenge
- From: Nicholas Kormanik <nkormanik@gmail.com>
- Database Challenge