Re: Database Challenge
- To: mathgroup at smc.vnet.net
- Subject: [mg106127] Re: Database Challenge
- From: Bill Rowe <readnews at sbcglobal.net>
- Date: Sat, 2 Jan 2010 05:05:22 -0500 (EST)
On 1/1/10 at 5:37 AM, nkormanik at gmail.com (Nicholas Kormanik) wrote:
>There are 12 records in this mini database. Two columns. First
>column are social security numbers. Second column are names.
>Unfortunately Jane Doe appears three times, with three different
>versions of her name, but having the same social security number.
>Challenge: Remove the duplicates, where social security is the
>same, and keep any one of the names. Final result will be whittled
>down to 10 records.
>(Real life problem has 6.5 million records, and lots of duplicates,
>with various versions of names.)
>025-60-4044 joe average
>004-16-4077 jane doe
>014-27-9076 mike smith
>098-43-2098 rodolfo pilas
>073-15-6005 gustavo boksar
>004-16-4077 jane a. doe
>147-79-9074 bea busaniche
>165-63-0189 pablo medrano
>124-96-7092 jeff aaron
>004-16-4077 jane anne doe
>172-30-6069 michael peters
>059-85-1062 leroy baker
=46irst, transform the database into a Mathematica list. This can
probably be easily accomplished by reading in the data using
Import. I say probably since you have provided no details of the
file format. In any case once the database is put into the form
of a Mathematica list such as:
data=
{
{"025-60-4044","joe average"},
{"004-16-4077","jane doe"},
{"014-27-9076","mike smith"},
{"098-43-2098","rodolfo pilas"},
{"073-15-6005","gustavo boksar"},
{"004-16-4077","jane a. doe"},
{"147-79-9074","bea busaniche"},
{"165-63-0189","pablo medrano"},
{"124-96-7092","jeff aaron"},
{"004-16-4077","jane anne doe"},
{"172-30-6069","michael peters"},
{"059-85-1062","leroy baker"}};
The desired result can be achieved in one line:
In[2]:= First /@ GatherBy[data, First]
Out[2]= {{"025-60-4044", "joe average"}, {"004-16-4077",
"jane doe"}, {"014-27-9076", "mike smith"}, {"098-43-2098",
"rodolfo pilas"}, {"073-15-6005", "gustavo boksar"}, {"147-79-9074",
"bea busaniche"}, {"165-63-0189", "pablo medrano"}, {"124-96-7092",
"jeff aaron"}, {"172-30-6069", "michael peters"}, {"059-85-1062",
"leroy baker"}}
Here, I've assumed you are using version 7 of Mathematica. If
not, you will need to sort the data base by the first item,
break them into sets using Split then take the first. Something like:
In[3]:= First /@
Split[data[[Ordering[First /@ data]]],
StringMatchQ[#1[[1]], #2[[1]]] &]
Out[3]= {{"004-16-4077", "jane doe"}, {"014-27-9076",
"mike smith"}, {"025-60-4044", "joe average"}, {"059-85-1062",
"leroy baker"}, {"073-15-6005", "gustavo boksar"}, {"098-43-2098",
"rodolfo pilas"}, {"124-96-7092", "jeff aaron"}, {"147-79-9074",
"bea busaniche"}, {"165-63-0189", "pablo medrano"}, {"172-30-6069",
"michael peters"}}