Quick Links

A DISTINCT problem removing duplicates

From:	Richard Huxton <dev(at)archonet(dot)com>
To:	PostgreSQL SQL List <pgsql-sql(at)postgresql(dot)org>
Subject:	A DISTINCT problem removing duplicates
Date:	2008-12-09 09:46:07
Message-ID:	493E3E5F.6090304@archonet.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-sql

The scenario is - a list of documents, some of which may be (near)
duplicates of others, one document being in many duplicate-sets and a
duplicate-set containing many documents.

We want to see a list with only one document (any one) from each
duplicate set. There's an example script attached.

So:

documents (docid SERIAL, title text, PRIMARY KEY (docid));
duplicates (docid int REFERENCES documents, dup_set SERIAL,
PRIMARY KEY (docid, dup_set));

This allows one document to be part of multiple duplicate sets, but
that's fine - this is a "fuzzy" match. If two documents match and one of
them is already in the duplicates table then I add the second with the
same dup_set value. If neither are present, generate a new set number.
Match documents in a well-defined order and it's all nice and simple.

A self-join on the duplicates table gives me a count of how many
duplicates each document has. A left-join from the documents table can
list documents and if/how many duplicates they have. The problem comes
when I don't want to see duplicates:

SELECT DISTINCT ON (dup_set)
ds.dup_set, d.docid, d.title, COALESCE(ds.num_dups, 0) AS num_dups
FROM
documents d
LEFT JOIN
(
SELECT dup1.docid, dup1.dup_set, count(*) - 1 AS num_dups
FROM duplicates dup1
JOIN duplicates dup2 USING (dup_set)
GROUP BY dup1.docid, dup1.dup_set
) ds
USING (docid)
ORDER BY dup_set, docid
;

Documents without duplicates have a NULL dup_set. The DISTINCT ON
considers two nulls to be equal, which means we only ever see one
unduplicated document.

I've got two work-arounds. The first is to create a separate sequence
that doesn't overlap with dup_set's values and use that:

CREATE SEQUENCE not_duplicate_seq MINVALUE -999999 MAXVALUE -1 CYCLE;

SELECT DISTINCT ON (dup_set)
COALESCE(dup_set, nextval('not_duplicate_seq')) AS dup_set,
d.docid, d.title, COALESCE(ds.num_dups, 0) AS num_dups
...

That works, but is arguably a bit too "clever" if you know what I mean.

The other alternative is to separate duplicated and non-duplicated
documents and UNION them. That's simple enough to see what's happening
but does seem ugly.

Anyone got anything more elegant? I'm happy to alter the duplicates
table so long as it doesn't make it complicated to update.

--
Richard Huxton
Archonet Ltd

Attachment	Content-Type	Size
distinct_dup_prob.sql	text/x-sql	2.7 KB

Responses

Re: A DISTINCT problem removing duplicates at 2008-12-09 13:50:47 from Tom Lane

Browse pgsql-sql by date

	From	Date	Subject
Next Message	ivan marchesini	2008-12-09 10:11:21	store pdf files
Previous Message	Scott Marlowe	2008-12-08 21:34:54	Re: Best way to restrict detail rows?