MathGroup Archive 2010

[Date Index] [Thread Index] [Author Index]

Search the Archive

Re: Database Challenge

  • To: mathgroup at smc.vnet.net
  • Subject: [mg106128] Re: Database Challenge
  • From: dh <dh at metrohm.com>
  • Date: Sat, 2 Jan 2010 05:05:34 -0500 (EST)
  • References: <hhkj95$56r$1@smc.vnet.net>

Hi Nicholas,
you may achieve this using Union with SameTest:
dat = {
   {"025-60-4044" , "joe average"},
   {"004-16-4077 ", "jane doe"},
   {"014-27-9076" , " mike smith"},
   {"098-43-2098 " , "rodolfo pilas"},
   {"073-15-6005 " , "gustavo boksar"},
   {"004-16-4077 " , "jane a.doe"},
   {"147-79-9074 " , "bea busaniche"},
   {"165-63-0189 " , "pablo medrano"},
   {"124-96-7092 " , "jeff aaron"},
   {"004-16-4077 " , "jane anne doe"},
   {"172-30-6069" , " michael peters"},
   {"059-85-1062" , " leroy baker"}
   };
Union[dat, SameTest -> (#1[[1]] == #2[[1]] &)]

Daniel


On 1 Jan., 11:37, Nicholas Kormanik <nkorma... at gmail.com> wrote:
> There are 12 records in this mini database.  Two columns.  First
> column are social security numbers.  Second column are names.
> Unfortunately Jane Doe appears three times, with three different
> versions of her name, but having the same social security number.
>
> Challenge:  Remove the duplicates, where social security is the same,
> and keep any one of the names.  Final result will be whittled down to
> 10 records.
>
> (Real life problem has 6.5 million records, and lots of duplicates,
> with various versions of names.)
>
> 025-60-4044       joe average
> 004-16-4077       jane doe
> 014-27-9076       mike smith
> 098-43-2098       rodolfo pilas
> 073-15-6005       gustavo boksar
> 004-16-4077       jane a. doe
> 147-79-9074       bea busaniche
> 165-63-0189       pablo medrano
> 124-96-7092       jeff aaron
> 004-16-4077       jane anne doe
> 172-30-6069       michael peters
> 059-85-1062       leroy baker



  • Prev by Date: Re: More /.{I->-1} craziness
  • Next by Date: Re: Database Challenge
  • Previous by thread: Re: Database Challenge
  • Next by thread: Re: Database Challenge