Re: Removing duplicates

From: Jeff Self <jself(at)nngov(dot)com>
To: Matthew Hagerty <matthew(at)brwholesale(dot)com>
Cc: pgsql-sql(at)postgresql(dot)org
Subject: Re: Removing duplicates
Date: 2002-02-26 16:19:36
Message-ID: 1014740376.12600.20.camel@personnel_test
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-sql

On Tue, 2002-02-26 at 10:10, Matthew Hagerty wrote:
> Greetings,
>
> I have a customer database (name, address1, address2, city, state, zip) and
> I need a query (or two) that will give me a mailing list with the least
> amount of duplicates possible. I know that precise matching is not
> possible, i.e. "P.O. Box 123" will never match "PO Box 123" without some
> data massaging, but if I can isolate even 50% of any duplicates, that would
> help greatly.
It depends on which fields you are wanting to check for. If you are
referring to just addresses then, you can do a query using 'Distinct' on
address1. This will give you only one record where address1 is P.O. Box
123. Of course this doesn't help with those that are PO Box 123. You
could always dump your data and write some Perl scripts to alter the
addresses so that they become alike. You could also use the 'Like'
keyword in your sql statement.
"Select * from customer where address1 like '%Box 123%';

After you run this query, you might want to modify the addresses so that
they are the same.

> Also, any suggestions on which parameters to check the duplicates for? My
> first thoughts were to make sure there were no two addresses the same in
> the same zip code. Any insight (or examples) would be greatly appreciated.

These constraints should be in the table definitions themselves, which
would prevent these duplications from taking place.

--
Jeff Self
Information Technology Analyst
Department of Personnel
City of Newport News
2400 Washington Ave.
Newport News, VA 23607
757-926-6930

In response to

Browse pgsql-sql by date

  From Date Subject
Next Message Josh Berkus 2002-02-26 16:39:26 Re: Removing duplicates
Previous Message Tom Lane 2002-02-26 15:55:41 Re: Timestamp output