Re: Database Challenge
- To: mathgroup at smc.vnet.net
- Subject: [mg106127] Re: Database Challenge
- From: Bill Rowe <readnews at sbcglobal.net>
- Date: Sat, 2 Jan 2010 05:05:22 -0500 (EST)
On 1/1/10 at 5:37 AM, nkormanik at gmail.com (Nicholas Kormanik) wrote: >There are 12 records in this mini database. Two columns. First >column are social security numbers. Second column are names. >Unfortunately Jane Doe appears three times, with three different >versions of her name, but having the same social security number. >Challenge: Remove the duplicates, where social security is the >same, and keep any one of the names. Final result will be whittled >down to 10 records. >(Real life problem has 6.5 million records, and lots of duplicates, >with various versions of names.) >025-60-4044 joe average >004-16-4077 jane doe >014-27-9076 mike smith >098-43-2098 rodolfo pilas >073-15-6005 gustavo boksar >004-16-4077 jane a. doe >147-79-9074 bea busaniche >165-63-0189 pablo medrano >124-96-7092 jeff aaron >004-16-4077 jane anne doe >172-30-6069 michael peters >059-85-1062 leroy baker =46irst, transform the database into a Mathematica list. This can probably be easily accomplished by reading in the data using Import. I say probably since you have provided no details of the file format. In any case once the database is put into the form of a Mathematica list such as: data= { {"025-60-4044","joe average"}, {"004-16-4077","jane doe"}, {"014-27-9076","mike smith"}, {"098-43-2098","rodolfo pilas"}, {"073-15-6005","gustavo boksar"}, {"004-16-4077","jane a. doe"}, {"147-79-9074","bea busaniche"}, {"165-63-0189","pablo medrano"}, {"124-96-7092","jeff aaron"}, {"004-16-4077","jane anne doe"}, {"172-30-6069","michael peters"}, {"059-85-1062","leroy baker"}}; The desired result can be achieved in one line: In[2]:= First /@ GatherBy[data, First] Out[2]= {{"025-60-4044", "joe average"}, {"004-16-4077", "jane doe"}, {"014-27-9076", "mike smith"}, {"098-43-2098", "rodolfo pilas"}, {"073-15-6005", "gustavo boksar"}, {"147-79-9074", "bea busaniche"}, {"165-63-0189", "pablo medrano"}, {"124-96-7092", "jeff aaron"}, {"172-30-6069", "michael peters"}, {"059-85-1062", "leroy baker"}} Here, I've assumed you are using version 7 of Mathematica. If not, you will need to sort the data base by the first item, break them into sets using Split then take the first. Something like: In[3]:= First /@ Split[data[[Ordering[First /@ data]]], StringMatchQ[#1[[1]], #2[[1]]] &] Out[3]= {{"004-16-4077", "jane doe"}, {"014-27-9076", "mike smith"}, {"025-60-4044", "joe average"}, {"059-85-1062", "leroy baker"}, {"073-15-6005", "gustavo boksar"}, {"098-43-2098", "rodolfo pilas"}, {"124-96-7092", "jeff aaron"}, {"147-79-9074", "bea busaniche"}, {"165-63-0189", "pablo medrano"}, {"172-30-6069", "michael peters"}}