Services & Resources / Wolfram Forums / MathGroup Archive
-----

MathGroup Archive 2010

[Date Index] [Thread Index] [Author Index]

Search the Archive

Re: Database Challenge

  • To: mathgroup at smc.vnet.net
  • Subject: [mg106136] Re: [mg106106] Database Challenge
  • From: DrMajorBob <btreat1 at austin.rr.com>
  • Date: Sat, 2 Jan 2010 05:07:07 -0500 (EST)
  • References: <201001011037.FAA05389@smc.vnet.net>
  • Reply-to: drmajorbob at yahoo.com

That's a trivial version of the REAL problem. Suppose the same social  
occurs for names that aren't similar. Further, suppose that when the  
social is the same and names ARE similar, each data record includes useful  
information the others do not. One has a telephone number, another has an  
address, etc. Suppose the (approximately) same name and address sometimes  
appears with different socials and phone numbers. Suppose all these  
problems occur frequently.

Good luck with that!

For the simpler problem you've stated, this does the trick:

data = {{"025-60-4044", "joe average"},
    {"004-16-4077", "jane doe"},
    {"014-27-9076", "mike smith"},
    {"098-43-2098", "rodolfo pilas"},
    {"073-15-6005", "gustavo boksar"},
    {"004-16-4077", "jane a.doe"},
    {"147-79-9074", "bea busaniche"},
    {"165-63-0189", "pablo medrano"},
    {"124-96-7092", "jeff aaron"},
    {"004-16-4077", "jane anne doe"},
    {"172-30-6069", "michael peters"},
    {"059-85-1062", "leroy baker"}};
SplitBy[Sort@data, First][[All, 1]]

{{"004-16-4077", "jane a.doe"}, {"014-27-9076",
   "mike smith"}, {"025-60-4044", "joe average"}, {"059-85-1062",
   "leroy baker"}, {"073-15-6005", "gustavo boksar"}, {"098-43-2098",
   "rodolfo pilas"}, {"124-96-7092", "jeff aaron"}, {"147-79-9074",
   "bea busaniche"}, {"165-63-0189", "pablo medrano"}, {"172-30-6069",
   "michael peters"}}

Bobby

On Fri, 01 Jan 2010 04:37:54 -0600, Nicholas Kormanik  
<nkormanik at gmail.com> wrote:

>
> There are 12 records in this mini database.  Two columns.  First
> column are social security numbers.  Second column are names.
> Unfortunately Jane Doe appears three times, with three different
> versions of her name, but having the same social security number.
>
> Challenge:  Remove the duplicates, where social security is the same,
> and keep any one of the names.  Final result will be whittled down to
> 10 records.
>
> (Real life problem has 6.5 million records, and lots of duplicates,
> with various versions of names.)
>
>
> 025-60-4044       joe average
> 004-16-4077       jane doe
> 014-27-9076       mike smith
> 098-43-2098       rodolfo pilas
> 073-15-6005       gustavo boksar
> 004-16-4077       jane a. doe
> 147-79-9074       bea busaniche
> 165-63-0189       pablo medrano
> 124-96-7092       jeff aaron
> 004-16-4077       jane anne doe
> 172-30-6069       michael peters
> 059-85-1062       leroy baker
>
>
>
>


-- 
DrMajorBob at yahoo.com


  • Prev by Date: Re: Re: algebraic numbers
  • Next by Date: Re: Re: Financial Data - Currencies
  • Previous by thread: Re: Database Challenge
  • Next by thread: Re: Database Challenge