Re: A DISTINCT problem removing duplicates

From: Richard Huxton <dev(at)archonet(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: PostgreSQL SQL List <pgsql-sql(at)postgresql(dot)org>
Subject: Re: A DISTINCT problem removing duplicates
Date: 2008-12-09 16:12:47
Message-ID: 493E98FF.6090201@archonet.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-sql

Tom Lane wrote:
> Richard Huxton <dev(at)archonet(dot)com> writes:
>> Tom Lane wrote:
>>> Richard Huxton <dev(at)archonet(dot)com> writes:
>>>> Anyone got anything more elegant?
>>> Seems to me that no document should have an empty dup_set. If it's not
>>> a match to any existing document, then immediately assign a new dup_set
>>> number to it.
>
>> That was my initial thought too, but it means when I actually find a
>> duplicate I have to decide which "direction" to renumber them in.
>
> Hmm, so you mean you might decide that two docs are duplicates sometime
> after initially putting them both in the database?

Yep - checking for duplicates can be a slow process - it's O(n^2) over
the number of documents and document-comparisons are probably O(n^2)
over length (or number of similarly-sized word-runs anyway). I'm failing
comparisons as early as I can, but there's a trade-off between speed
and false negatives.

> Seems like you have
> issues with that anyway. If you already know A,B are dups and
> separately that C,D are dups, and you later decide B and C are dups,
> what do you do?

Not necessarily a problem. I'm using "duplicate" very loosely here -
it's more like "very similar to" so it's entirely possible to have sets
(a,b) (b,c) (c,d) and everything be valid just by adding sentences to
the end of each document. Similarity scoring should allow for
insertion/deletion of single words or whole (quite extensive) blocks of
text.

Of course at the moment, as I tweak what I mean by "duplicate" I have to
re-run the check over at least a sizeable chunk of the documents to see
if I prefer it.

Oh - the comparison is outside the DB at the moment, but it's based on
the stemmed tsvector of each document anyway, so it's crying out to be
pushed into the DB once I'm happy it works.

--
Richard Huxton
Archonet Ltd

In response to

Browse pgsql-sql by date

  From Date Subject
Next Message Stefano Buliani 2008-12-09 17:28:21 inconsistent automatic casting between psql and function
Previous Message Tom Lane 2008-12-09 15:39:29 Re: A DISTINCT problem removing duplicates