Re: Removing duplicates

From: "Josh Berkus" <josh(at)agliodbs(dot)com>
To: Matthew Hagerty <matthew(at)brwholesale(dot)com>, pgsql-sql(at)postgresql(dot)org
Subject: Re: Removing duplicates
Date: 2002-02-26 16:39:26
Message-ID: web-810507@davinci.ethosmedia.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-sql

Matt,

> I have a customer database (name, address1, address2, city, state,
> zip) and I need a query (or two) that will give me a mailing list
> with the least amount of duplicates possible. I know that precise
> matching is not possible, i.e. "P.O. Box 123" will never match "PO
> Box 123" without some data massaging, but if I can isolate even 50%
> of any duplicates, that would help greatly.

From the sound of things, you are trying to get out a mailing with the
least number of duplicates you can in a limited time, rather than
trying to clean up the list for permanent storage. Chances are, you
bought or traded this list from an outside source, yes?

In that case, here's some quick de-duplication tricks from my
fundraising days:

1. Compare text columns with the punctuation stripped out. It's
amazing how many typographical differences come down to punctuation.
FYI, in Roberto Mello's function catalog (accessable from
http://techdocs.postgresql.org/ ) I believe that I posted some
punctuation-stripping PL/pgSQL procedures. For further
de-duplication, compare only the left 15 characters of a text field
(e.g. SUBSTR(strip_string(address1), 1, 15)), but beware ... this can
cause you to weed out some non-duplicates, such as multiple residents
of large apartment buildings.

2. For box office lists, you can use phonetic formulas to compare
personal names (NOT addresses). I believe that Soundex and Metaphone
are included in /contrib these days. This does not work well on
Chinese or Southeast Asian names.

3. If you got phone numbers along with the addresses, these are an
excellend guage of uniqueness.

> Also, any suggestions on which parameters to check the duplicates
> for? My first thoughts were to make sure there were no two
> addresses the same in the same zip code. Any insight (or examples)
> would be greatly appreciated.

Just don't forget that some of the zip codes will probably be
erroneous.

-Josh

In response to

Responses

Browse pgsql-sql by date

  From Date Subject
Next Message Jean-Luc Lachance 2002-02-26 16:41:32 Re: Sort problem
Previous Message Jeff Self 2002-02-26 16:19:36 Re: Removing duplicates