Quick Links

Re: SQL for removing duplicates?

From:	<kynn(at)panix(dot)com>
To:	pgsql-novice(at)postgresql(dot)org
Subject:	Re: SQL for removing duplicates?
Date:	2006-06-13 22:22:37
Message-ID:	200606132222.k5DMMbv21040@panix3.panix.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-novice

From: Brad Nicholson <bnichols(at)ca(dot)afilias(dot)info>
Date: Jun 13, 2006 2:26 PM
Subject: Re: [NOVICE] SQL for removing duplicates?
To: kynn(at)panix(dot)com

kynn(at)panix(dot)com wrote:
> Hi. I'm stumped. I have a large table (about 8.5M records), let's
> call it t, whose columns include x and y. I want to remove records
> from this table so that any pair of values for these two fields appear
> only once. (This will get rid of about 15% of the records in t.)
>
> One simple solution would be something like
>
> CREATE TABLE tmp AS SELECT DISTINCT ON ( x, y ) * FROM t;
> DROP TABLE t;
> ALTER TABLE tmp RENAME TO t;
>
> This works, but it uses a lot of space. I would prefer to simply cull
> the unwanted records from t, but I just can't figure out the SQL for
> it. Any help with it would be *much* appreciated.

If your table is created with OIDs, this should work. If not add a
unique column to the table and use that in place of oid.

DELETE FROM t where oid IN (select t2.oid from t t2 EXCEPT SELECT
max(t3.oid) from t t3 group by t3.x, t3.y);

Also note, the query plan for this is going to be very ugly, it might
very well be cheaper to use the solution that you initially mentioned.

Thanks! That's just what I was looking for.

--
Brad Nicholson 416-673-4106
Database Administrator, Afilias Canada Corp.

------=_Part_9560_6324061.1150236298505
Content-Type: text/html; charset=ISO-8859-1
Content-Transfer-Encoding: 7bit
Content-Disposition: inline

---------- Forwarded message ---------- From: Brad Nicholson <<a href="mailto:bnichols(at)ca(dot)afilias(dot)info">bnichols(at)ca(dot)afilias(dot)info</a>> Date: Jun 13, 2006 2:26 PM
 Subject: Re: [NOVICE] SQL for removing duplicates? To: <a href="mailto:kynn(at)panix(dot)com">kynn(at)panix(dot)com</a> Cc: <a href="mailto:pgsql-novice(at)postgresql(dot)org">pgsql-novice(at)postgresql(dot)org</a> <a href="mailto:kynn(at)panix(dot)com">
kynn(at)panix(dot)com</a> wrote: > Hi.  I'm stumped.  I have a large table (about 8.5M records), let's > call it t, whose columns include x and y.  I want to remove records > from this table so that any pair of values for these two fields appear
 > only once.  (This will get rid of about 15% of the records in t.) > > One simple solution would be something like > >   CREATE TABLE tmp AS SELECT DISTINCT ON ( x, y ) * FROM t; >   DROP TABLE t;
 >   ALTER TABLE tmp RENAME TO t; > > This works, but it uses a lot of space.  I would prefer to simply cull > the unwanted records from t, but I just can't figure out the SQL for > it.  Any help with it would be *much* appreciated.
 If your table is created with OIDs, this should work.  If not add a unique column to the table and use that in place of oid. DELETE FROM t where oid IN (select t2.oid from t t2 EXCEPT SELECT max(t3.oid
) from t t3 group by t3.x, t3.y); Also note, the query plan for this is going to be very ugly, it might very well be cheaper to use the solution that you initially mentioned. -- Brad Nicholson  416-673-4106
 Database Administrator, Afilias Canada Corp.

------=_Part_9560_6324061.1150236298505--

In response to

Re: SQL for removing duplicates? at 2006-06-13 18:26:04 from Brad Nicholson

Browse pgsql-novice by date

	From	Date	Subject
Next Message	Tom Allison	2006-06-14 00:19:22	ruby
Previous Message	Brad Nicholson	2006-06-13 18:26:04	Re: SQL for removing duplicates?