MathGroup Archive 2010

[Date Index] [Thread Index] [Author Index]

Search the Archive

Re: Database Challenge

  • To: mathgroup at
  • Subject: [mg106127] Re: Database Challenge
  • From: Bill Rowe <readnews at>
  • Date: Sat, 2 Jan 2010 05:05:22 -0500 (EST)

On 1/1/10 at 5:37 AM, nkormanik at (Nicholas Kormanik) wrote:

>There are 12 records in this mini database.  Two columns.  First
>column are social security numbers.  Second column are names.
>Unfortunately Jane Doe appears three times, with three different
>versions of her name, but having the same social security number.

>Challenge:  Remove the duplicates, where social security is the
>same, and keep any one of the names.  Final result will be whittled
>down to 10 records.

>(Real life problem has 6.5 million records, and lots of duplicates,
>with various versions of names.)

>025-60-4044       joe average
>004-16-4077       jane doe
>014-27-9076       mike smith
>098-43-2098       rodolfo pilas
>073-15-6005       gustavo boksar
>004-16-4077       jane a. doe
>147-79-9074       bea busaniche
>165-63-0189       pablo medrano
>124-96-7092       jeff aaron
>004-16-4077       jane anne doe
>172-30-6069       michael peters
>059-85-1062       leroy baker

=46irst, transform the database into a Mathematica list. This can
probably be easily accomplished by reading in the data using
Import. I say probably since you have provided no details of the
file format. In any case once the database is put into the form
of a Mathematica list such as:

{"025-60-4044","joe average"},
{"004-16-4077","jane doe"},
{"014-27-9076","mike smith"},
{"098-43-2098","rodolfo pilas"},
{"073-15-6005","gustavo boksar"},
{"004-16-4077","jane a. doe"},
{"147-79-9074","bea busaniche"},
{"165-63-0189","pablo medrano"},
{"124-96-7092","jeff aaron"},
{"004-16-4077","jane anne doe"},
{"172-30-6069","michael peters"},
{"059-85-1062","leroy baker"}};

The desired result can be achieved in one line:

In[2]:= First /@ GatherBy[data, First]

Out[2]= {{"025-60-4044", "joe average"}, {"004-16-4077",
   "jane doe"}, {"014-27-9076", "mike smith"}, {"098-43-2098",
   "rodolfo pilas"}, {"073-15-6005", "gustavo boksar"}, {"147-79-9074",
    "bea busaniche"}, {"165-63-0189", "pablo medrano"}, {"124-96-7092",
    "jeff aaron"}, {"172-30-6069", "michael peters"}, {"059-85-1062",
   "leroy baker"}}

Here, I've assumed you are using version 7 of Mathematica. If
not, you will need to sort the data base by the first item,
break them into sets using Split then take the first. Something like:

In[3]:= First /@
  Split[data[[Ordering[First /@ data]]],
   StringMatchQ[#1[[1]], #2[[1]]] &]

Out[3]= {{"004-16-4077", "jane doe"}, {"014-27-9076",
   "mike smith"}, {"025-60-4044", "joe average"}, {"059-85-1062",
   "leroy baker"}, {"073-15-6005", "gustavo boksar"}, {"098-43-2098",
   "rodolfo pilas"}, {"124-96-7092", "jeff aaron"}, {"147-79-9074",
   "bea busaniche"}, {"165-63-0189", "pablo medrano"}, {"172-30-6069",
   "michael peters"}}

  • Prev by Date: Re: Re: Re: Re: algebraic numbers
  • Next by Date: Re: Re: algebraic numbers
  • Previous by thread: Re: Database Challenge
  • Next by thread: Re: Replace and ReplaceAll -- simple application