Re: Duplicate data within tables

From: mike(at)if-then-else(dot)pl
To: pgsql-novice(at)postgresql(dot)org
Subject: Re: Duplicate data within tables
Date: 2010-12-06 14:43:06
Message-ID: 20101206154306.1al40zrleogssk88@horde.iq.pl
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-novice

Quoting Machiel Richards <machielr(at)rdc(dot)co(dot)za>:

> I am trying to find out how you can identify duplicate data
> within a postgresql database.

Machiel,

The problem with duplicates can cause dozens of problems with the integrity
and sanity of the data. There is no simple trick and you need to be very
careful. The more careful the more complex data you work with. Let me
show you a simple example, so you get the basic idea. Let's say we have a
table "users" with columns: "id", "name" and "email" and some duplicated
emails in that table, which we want to get rid of:

SELECT * FROM users;

id | name | email
----+---------+---------------------
1 | john | john(at)example(dot)com
2 | joseph | joseph(at)example(dot)com
3 | johnny | john(at)example(dot)com
4 | mike | michael(at)example(dot)com
5 | admin | admin(at)example(dot)com
6 | michael | michael(at)example(dot)com
7 | foo | john(at)example(dot)com

To identify duplicates (email with count > 1):

SELECT email, count(1) FROM users GROUP BY email ORDER BY count DESC;

email | count
---------------------+-------
john(at)example(dot)com | 3
michael(at)example(dot)com | 2
joseph(at)example(dot)com | 1
admin(at)example(dot)com | 1

To create a table without duplicates:

CREATE TABLE distinct_users (
id serial PRIMARY KEY,
name text,
email text UNIQUE
);

INSERT INTO distinct_users (name, email) SELECT min(name), email FROM users
GROUP BY email;

SELECT * FROM distinct_users;

id | name | email
----+---------+---------------------
1 | joseph | joseph(at)example(dot)com
2 | michael | michael(at)example(dot)com
3 | foo | john(at)example(dot)com
4 | admin | admin(at)example(dot)com

To select only the duplicates:

SELECT * FROM users WHERE name NOT IN (SELECT name FROM distinct_users);

id | name | email
----+--------+---------------------
1 | john | john(at)example(dot)com
3 | johnny | john(at)example(dot)com
4 | mike | michael(at)example(dot)com

As I said before, there is no simple trick that would resolve the duplicates
problem in 974-table database. It is not only a design problem. Since the
database is full of date, it became a content problem as well (if not
mainly). You do not want to loose part of your data, do you? So you need
to recreate the logic in more strict way and then put all your existing
data into it. Esentially you have to take the duplicated data, think of it
and pick the most important part of it and then define as many one-to-many
relations, as you need. Let's get back to our "users" table and let's say
the most significant information is the email. We already created the
"distinct_users" table with UNIQUE emails. We picked one "name", now we
need to work on the rest of the information a little bit. That would be:
former id's and additional names of the user used in the past.

ALTER TABLE users ADD COLUMN new_id integer;
UPDATE users SET new_id = (SELECT distinct_users.id FROM distinct_users
WHERE distinct_users.email = users.email);

SELECT * FROM users ORDER BY new_id;

id | name | email | new_id
----+---------+---------------------+--------
2 | joseph | joseph(at)example(dot)com | 1
4 | mike | michael(at)example(dot)com | 2
6 | michael | michael(at)example(dot)com | 2
1 | john | john(at)example(dot)com | 3
7 | foo | john(at)example(dot)com | 3
3 | johnny | john(at)example(dot)com | 3
5 | admin | admin(at)example(dot)com | 4

ALTER TABLE users ADD COLUMN default_name_now boolean DEFAULT FALSE;
UPDATE users SET default_name_now = (SELECT users.name = distinct_users.name
FROM distinct_users WHERE users.new_id = distinct_users.id);

SELECT * FROM users ORDER BY new_id;

id | name | email | new_id | default_name_now
----+---------+---------------------+--------+------------------
2 | joseph | joseph(at)example(dot)com | 1 | t
4 | mike | michael(at)example(dot)com | 2 | f
6 | michael | michael(at)example(dot)com | 2 | t
1 | john | john(at)example(dot)com | 3 | f
7 | foo | john(at)example(dot)com | 3 | t
3 | johnny | john(at)example(dot)com | 3 | f
5 | admin | admin(at)example(dot)com | 4 | t

Here is how you can regain some control over the mess and improve your
ability to manage the data. I hope, these examples will help you solve the
problem.

Good luck!

-Mike

--
Michał Roszka
mike(at)if-then-else(dot)pl

In response to

Browse pgsql-novice by date

  From Date Subject
Next Message Tom Lane 2010-12-06 15:33:36 Re: C Function returning a tuple with a float4 array as column
Previous Message Tim 2010-12-06 13:44:36 Re: C Function returning a tuple with a float4 array as column