Re: Database Challenge
- To: mathgroup at smc.vnet.net
- Subject: [mg106136] Re: [mg106106] Database Challenge
- From: DrMajorBob <btreat1 at austin.rr.com>
- Date: Sat, 2 Jan 2010 05:07:07 -0500 (EST)
- References: <201001011037.FAA05389@smc.vnet.net>
- Reply-to: drmajorbob at yahoo.com
That's a trivial version of the REAL problem. Suppose the same social occurs for names that aren't similar. Further, suppose that when the social is the same and names ARE similar, each data record includes useful information the others do not. One has a telephone number, another has an address, etc. Suppose the (approximately) same name and address sometimes appears with different socials and phone numbers. Suppose all these problems occur frequently. Good luck with that! For the simpler problem you've stated, this does the trick: data = {{"025-60-4044", "joe average"}, {"004-16-4077", "jane doe"}, {"014-27-9076", "mike smith"}, {"098-43-2098", "rodolfo pilas"}, {"073-15-6005", "gustavo boksar"}, {"004-16-4077", "jane a.doe"}, {"147-79-9074", "bea busaniche"}, {"165-63-0189", "pablo medrano"}, {"124-96-7092", "jeff aaron"}, {"004-16-4077", "jane anne doe"}, {"172-30-6069", "michael peters"}, {"059-85-1062", "leroy baker"}}; SplitBy[Sort@data, First][[All, 1]] {{"004-16-4077", "jane a.doe"}, {"014-27-9076", "mike smith"}, {"025-60-4044", "joe average"}, {"059-85-1062", "leroy baker"}, {"073-15-6005", "gustavo boksar"}, {"098-43-2098", "rodolfo pilas"}, {"124-96-7092", "jeff aaron"}, {"147-79-9074", "bea busaniche"}, {"165-63-0189", "pablo medrano"}, {"172-30-6069", "michael peters"}} Bobby On Fri, 01 Jan 2010 04:37:54 -0600, Nicholas Kormanik <nkormanik at gmail.com> wrote: > > There are 12 records in this mini database. Two columns. First > column are social security numbers. Second column are names. > Unfortunately Jane Doe appears three times, with three different > versions of her name, but having the same social security number. > > Challenge: Remove the duplicates, where social security is the same, > and keep any one of the names. Final result will be whittled down to > 10 records. > > (Real life problem has 6.5 million records, and lots of duplicates, > with various versions of names.) > > > 025-60-4044 joe average > 004-16-4077 jane doe > 014-27-9076 mike smith > 098-43-2098 rodolfo pilas > 073-15-6005 gustavo boksar > 004-16-4077 jane a. doe > 147-79-9074 bea busaniche > 165-63-0189 pablo medrano > 124-96-7092 jeff aaron > 004-16-4077 jane anne doe > 172-30-6069 michael peters > 059-85-1062 leroy baker > > > > -- DrMajorBob at yahoo.com
- References:
- Database Challenge
- From: Nicholas Kormanik <nkormanik@gmail.com>
- Database Challenge