MathGroup Archive 2010

[Date Index] [Thread Index] [Author Index]

Search the Archive

Re: Database Challenge

  • To: mathgroup at
  • Subject: [mg106145] Re: [mg106106] Database Challenge
  • From: Bob Hanlon <hanlonr at>
  • Date: Sat, 2 Jan 2010 05:08:53 -0500 (EST)
  • Reply-to: hanlonr at

Your data must be read in as strings

data = {{"025-60-4044", "joe average"},
   {"004-16-4077", "jane doe"},
   {"014-27-9076", "mike smith"},
   {"098-43-2098", "rodolfo pilas"},
   {"073-15-6005", "gustavo boksar"},
   {"004-16-4077", "jane a.doe"},
   {"147-79-9074", "bea busaniche"},
   {"165-63-0189", "pablo medrano"},
   {"124-96-7092", "jeff aaron"},
   {"004-16-4077", "jane anne doe"},
   {"172-30-6069", "michael peters"},
   {"059-85-1062", "leroy baker"}};

Here as some of the ways

Union[data, SameTest -> (#1[[1]] == #2[[1]] &)]

DeleteDuplicates[data, #1[[1]] == #2[[1]] &]

First /@ GatherBy[data, First]

First /@ SplitBy[SortBy[data, First], First]

Bob Hanlon

---- Nicholas Kormanik <nkormanik at> wrote: 


There are 12 records in this mini database.  Two columns.  First
column are social security numbers.  Second column are names.
Unfortunately Jane Doe appears three times, with three different
versions of her name, but having the same social security number.

Challenge:  Remove the duplicates, where social security is the same,
and keep any one of the names.  Final result will be whittled down to
10 records.

(Real life problem has 6.5 million records, and lots of duplicates,
with various versions of names.)

025-60-4044       joe average
004-16-4077       jane doe
014-27-9076       mike smith
098-43-2098       rodolfo pilas
073-15-6005       gustavo boksar
004-16-4077       jane a. doe
147-79-9074       bea busaniche
165-63-0189       pablo medrano
124-96-7092       jeff aaron
004-16-4077       jane anne doe
172-30-6069       michael peters
059-85-1062       leroy baker

  • Prev by Date: Re: Database Challenge
  • Next by Date: Re: Database Challenge
  • Previous by thread: Re: Database Challenge
  • Next by thread: Re: Database Challenge