identifying duplicates in table with redundancies

From: Tarlika Elisabeth Schmitz <postgresql(at)numerixtechnology(dot)de>
To: pgsql-sql(at)postgresql(dot)org
Subject: identifying duplicates in table with redundancies
Date: 2010-09-23 21:39:40
Message-ID: 20100923223940.5019475e@dick.coachhouse
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-sql

I loaded data from a spread into a interim table so I can analyze the
quality of the data.

The table contains an entry for every student (250K records) and his
trainer. Eventually, I want to extract a unique list of trainers from
it. But first of all I want to check for duplicates:

1) multiples trainer names for same trainer id
2) multiple trainer ids for same trainer name

I cobbled together the SQL and it does the job but it seems rather
convoluted. I would like to know how I can improve it.

CREATE TABLE student (
id INTEGER NOT NULL,
name VARCHAR(256) NOT NULL,
trainer_id INTEGER,
trainer_name VARCHAR(256),
);

====
EXAMPLE DATA

22 John 1 Macdonald
23 Jane 1 MacDonald
24 Paul 1 MacDonald
25 Dick 2 Smith
26 Bill 3 Smith
27 Kate 3 Smith
====

-- outputs trainer ids which appear under different names
select trainer_id, trainer_name from
(
-- different id/name combinations
select distinct on (trainer_name) trainer_id, trainer_name
from student
where trainer_id in
(
-- trainer ids with appearing with different names
select distinct on (id) id
from
(
-- distinct trainer id-name
select distinct on (trainer_id,trainer_name)
trainer_id as id,
trainer_name as name from student
) as trainer
group by trainer.id
having count (trainer.name) > 1
)
) as y
order by trainer_id

--

Best Regards,
Tarlika Elisabeth Schmitz

Responses

Browse pgsql-sql by date

  From Date Subject
Next Message Nicholas I 2010-09-24 09:15:54 find and replace the string within a column
Previous Message Rob Sargent 2010-09-23 18:46:55 Re: pg_config -less