Re: Removing duplicates

From: gnews84(at)hotmail(dot)com (OV)
To: pgsql-sql(at)postgresql(dot)org
Subject: Re: Removing duplicates
Date: 2002-02-27 04:52:49
Message-ID: a03f6b0d.0202262052.794fa621@posting.google.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-sql

Besides the obvious problems that you mention such as punctuation
spelling etc., there are also a few other things that you should keep
in mind:
- A lot of your customers may have moved in the past 8 years. In fact
I was reading somewhere that about 10% of US residents move on an
average of once a year. Especially apartment addresses in your
database are very unreliable.
- Phone numbers are even more unreliable because not only that they
change when a person moves but they also change when an area code
split occurs. Also in many cases people may give you a work phone.
-Women change Last Names when they marry, divorce etc.

Based on the budget and the number of records in you database you can
do different things to improve your data quality:
If you have the money to spend and the database is large, you can buy
some commercially available products such as First Logic’s ACE
http://www.firstlogic.com/home.asp
or Group One's "Code-1 Plus"
http://www.g1.com/g1intro.html
These products have a database of all addresses in the US standardized
by USPS standards. They can do "soft matches" on addresses for a given
zip code and then standardize the address. Or they can add zip and zip
+ 4 based on you street address and city. Or they can match state with
zip, county with zip, town with zip etc.
On top of that, you may want to NCOA your database. NCOA stands for
National change of address and it is a huge database that USPS keeps
with every address change that people made in the past (I am not sure
how far back it goes). You can find many 3rd party vendors that will
NCOA your database. Just go to goggle and do a search under "NCOA".
Be aware that the software can cost at least $20,000 and the NCOA may
cost 1 cent per address.
A cheaper option for a one time job would be to standardize and NCOA
you database through 3rd parties. I am not sure what it will cost but
if I find more info I will post it here.

The third option for improving your data quality is to use some
inexpensive third party databases to match your zip code with area
code and state. For example look at:
http://www.zipinfo.com/

In order to match customers you have to have an idea of how many
records you have on each zip code. If the number is small (less then
200-300) you can match on zip code, primary range and last name.
Primary range is the number before the address. (For example in
"1234 W. Main Street", primary range is 1234. In terms of SQL you need
to match on first 4 characters of Address1 field. If customers are
concentrated in one or a few zip codes then you probably need to match
on full address1.

Hope this helps.

Tony

PS. Did I tell you that I do consulting on data cleansing and
Merge/Purge :-)
gnews84(at)hotmail(dot)com

In response to

Browse pgsql-sql by date

  From Date Subject
Next Message Christopher Kings-Lynne 2002-02-27 05:23:56 Re: Left Outer Join Question
Previous Message Edward Murray 2002-02-27 04:18:29 Left Outer Join Question