MathGroup Archive 2010

[Date Index] [Thread Index] [Author Index]

Search the Archive

Re: Database Challenge

  • To: mathgroup at smc.vnet.net
  • Subject: [mg106130] Re: [mg106106] Database Challenge
  • From: Leonid Shifrin <lshifr at gmail.com>
  • Date: Sat, 2 Jan 2010 05:05:57 -0500 (EST)
  • References: <201001011037.FAA05389@smc.vnet.net>

Hi Nicholas,

given your table as a nested list of two columns:

In[1]:=
ssnNames =
  {{"025-60-4044", "004-16-4077", "014-27-9076", "098-43-2098",
    "073-15-6005", "004-16-4077", "147-79-9074", "165-63-0189",
    "124-96-7092", "004-16-4077", "172-30-6069",
    "059-85-1062"}, {"joe average", "jane doe", "mike smith",
    "rodolfo pilas", "gustavo boksar", "jane a.doe", "bea busaniche",
    "pablo medrano", "jeff aaron", "jane anne doe", "michael peters",
    "leroy baker"}};

The simplest is probably to use GatherBy and pick the first element in each
of the generated sublists with the same ssn-s (since you don't care which
name to choose for the same ssn)

In[2]:= Transpose[GatherBy[Transpose@ssnNames, First][[All, 1]]]

Out[2]= {{"025-60-4044", "004-16-4077", "014-27-9076", "098-43-2098",
   "073-15-6005", "147-79-9074", "165-63-0189", "124-96-7092",
  "172-30-6069", "059-85-1062"}, {"joe average", "jane doe",
  "mike smith", "rodolfo pilas", "gustavo boksar", "bea busaniche",
  "pablo medrano", "jeff aaron", "michael peters", "leroy baker"}}


Regards,
Leonid


On Fri, Jan 1, 2010 at 2:37 AM, Nicholas Kormanik <nkormanik at gmail.com>wrote:

>
> There are 12 records in this mini database.  Two columns.  First
> column are social security numbers.  Second column are names.
> Unfortunately Jane Doe appears three times, with three different
> versions of her name, but having the same social security number.
>
> Challenge:  Remove the duplicates, where social security is the same,
> and keep any one of the names.  Final result will be whittled down to
> 10 records.
>
> (Real life problem has 6.5 million records, and lots of duplicates,
> with various versions of names.)
>
>
> 025-60-4044       joe average
> 004-16-4077       jane doe
> 014-27-9076       mike smith
> 098-43-2098       rodolfo pilas
> 073-15-6005       gustavo boksar
> 004-16-4077       jane a. doe
> 147-79-9074       bea busaniche
> 165-63-0189       pablo medrano
> 124-96-7092       jeff aaron
> 004-16-4077       jane anne doe
> 172-30-6069       michael peters
> 059-85-1062       leroy baker
>
>
>
>
>


  • Prev by Date: Re: Re: Financial Data - Currencies
  • Next by Date: Re: Question about the derivative of Abs
  • Previous by thread: Re: Database Challenge
  • Next by thread: Re: Database Challenge