Re: Database Challenge
- To: mathgroup at smc.vnet.net
- Subject: [mg106130] Re: [mg106106] Database Challenge
- From: Leonid Shifrin <lshifr at gmail.com>
- Date: Sat, 2 Jan 2010 05:05:57 -0500 (EST)
- References: <201001011037.FAA05389@smc.vnet.net>
Hi Nicholas, given your table as a nested list of two columns: In[1]:= ssnNames = {{"025-60-4044", "004-16-4077", "014-27-9076", "098-43-2098", "073-15-6005", "004-16-4077", "147-79-9074", "165-63-0189", "124-96-7092", "004-16-4077", "172-30-6069", "059-85-1062"}, {"joe average", "jane doe", "mike smith", "rodolfo pilas", "gustavo boksar", "jane a.doe", "bea busaniche", "pablo medrano", "jeff aaron", "jane anne doe", "michael peters", "leroy baker"}}; The simplest is probably to use GatherBy and pick the first element in each of the generated sublists with the same ssn-s (since you don't care which name to choose for the same ssn) In[2]:= Transpose[GatherBy[Transpose@ssnNames, First][[All, 1]]] Out[2]= {{"025-60-4044", "004-16-4077", "014-27-9076", "098-43-2098", "073-15-6005", "147-79-9074", "165-63-0189", "124-96-7092", "172-30-6069", "059-85-1062"}, {"joe average", "jane doe", "mike smith", "rodolfo pilas", "gustavo boksar", "bea busaniche", "pablo medrano", "jeff aaron", "michael peters", "leroy baker"}} Regards, Leonid On Fri, Jan 1, 2010 at 2:37 AM, Nicholas Kormanik <nkormanik at gmail.com>wrote: > > There are 12 records in this mini database. Two columns. First > column are social security numbers. Second column are names. > Unfortunately Jane Doe appears three times, with three different > versions of her name, but having the same social security number. > > Challenge: Remove the duplicates, where social security is the same, > and keep any one of the names. Final result will be whittled down to > 10 records. > > (Real life problem has 6.5 million records, and lots of duplicates, > with various versions of names.) > > > 025-60-4044 joe average > 004-16-4077 jane doe > 014-27-9076 mike smith > 098-43-2098 rodolfo pilas > 073-15-6005 gustavo boksar > 004-16-4077 jane a. doe > 147-79-9074 bea busaniche > 165-63-0189 pablo medrano > 124-96-7092 jeff aaron > 004-16-4077 jane anne doe > 172-30-6069 michael peters > 059-85-1062 leroy baker > > > > >
- References:
- Database Challenge
- From: Nicholas Kormanik <nkormanik@gmail.com>
- Database Challenge