Re: SQL for removing duplicates?

From: <kynn(at)panix(dot)com>
To: pgsql-novice(at)postgresql(dot)org
Subject: Re: SQL for removing duplicates?
Date: 2006-06-13 22:22:37
Message-ID: 200606132222.k5DMMbv21040@panix3.panix.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-novice

From: Brad Nicholson <bnichols(at)ca(dot)afilias(dot)info>
Date: Jun 13, 2006 2:26 PM
Subject: Re: [NOVICE] SQL for removing duplicates?
To: kynn(at)panix(dot)com

kynn(at)panix(dot)com wrote:
> Hi. I'm stumped. I have a large table (about 8.5M records), let's
> call it t, whose columns include x and y. I want to remove records
> from this table so that any pair of values for these two fields appear
> only once. (This will get rid of about 15% of the records in t.)
>
> One simple solution would be something like
>
> CREATE TABLE tmp AS SELECT DISTINCT ON ( x, y ) * FROM t;
> DROP TABLE t;
> ALTER TABLE tmp RENAME TO t;
>
> This works, but it uses a lot of space. I would prefer to simply cull
> the unwanted records from t, but I just can't figure out the SQL for
> it. Any help with it would be *much* appreciated.

If your table is created with OIDs, this should work. If not add a
unique column to the table and use that in place of oid.

DELETE FROM t where oid IN (select t2.oid from t t2 EXCEPT SELECT
max(t3.oid) from t t3 group by t3.x, t3.y);

Also note, the query plan for this is going to be very ugly, it might
very well be cheaper to use the solution that you initially mentioned.

Thanks! That's just what I was looking for.

kj

--
Brad Nicholson 416-673-4106
Database Administrator, Afilias Canada Corp.

------=_Part_9560_6324061.1150236298505
Content-Type: text/html; charset=ISO-8859-1
Content-Transfer-Encoding: 7bit
Content-Disposition: inline

<br><br>---------- Forwarded message ----------<br><span class="gmail_quote">From: <b class="gmail_sendername">Brad Nicholson</b> &lt;<a href="mailto:bnichols(at)ca(dot)afilias(dot)info">bnichols(at)ca(dot)afilias(dot)info</a>&gt;<br>Date: Jun 13, 2006 2:26 PM
<br>Subject: Re: [NOVICE] SQL for removing duplicates?<br>To: <a href="mailto:kynn(at)panix(dot)com">kynn(at)panix(dot)com</a><br>Cc: <a href="mailto:pgsql-novice(at)postgresql(dot)org">pgsql-novice(at)postgresql(dot)org</a><br><br></span><a href="mailto:kynn(at)panix(dot)com">
kynn(at)panix(dot)com</a> wrote:<br>&gt; Hi.&nbsp;&nbsp;I'm stumped.&nbsp;&nbsp;I have a large table (about 8.5M records), let's<br>&gt; call it t, whose columns include x and y.&nbsp;&nbsp;I want to remove records<br>&gt; from this table so that any pair of values for these two fields appear
<br>&gt; only once.&nbsp;&nbsp;(This will get rid of about 15% of the records in t.)<br>&gt;<br>&gt; One simple solution would be something like<br>&gt;<br>&gt;&nbsp;&nbsp; CREATE TABLE tmp AS SELECT DISTINCT ON ( x, y ) * FROM t;<br>&gt;&nbsp;&nbsp; DROP TABLE t;
<br>&gt;&nbsp;&nbsp; ALTER TABLE tmp RENAME TO t;<br>&gt;<br>&gt; This works, but it uses a lot of space.&nbsp;&nbsp;I would prefer to simply cull<br>&gt; the unwanted records from t, but I just can't figure out the SQL for<br>&gt; it.&nbsp;&nbsp;Any help with it would be *much* appreciated.
<br><br><br>If your table is created with OIDs, this should work.&nbsp;&nbsp;If not add a<br>unique column to the table and use that in place of oid.<br><br>DELETE FROM t where oid IN (select t2.oid from t t2 EXCEPT SELECT<br>max(t3.oid
) from t t3 group by t3.x, t3.y);<br><br>Also note, the query plan for this is going to be very ugly, it might<br>very well be cheaper to use the solution that you initially mentioned.<br><br><br><br><br><br><br>--<br>Brad Nicholson&nbsp;&nbsp;416-673-4106
<br>Database Administrator, Afilias Canada Corp.<br><br>

------=_Part_9560_6324061.1150236298505--

In response to

Browse pgsql-novice by date

  From Date Subject
Next Message Tom Allison 2006-06-14 00:19:22 ruby
Previous Message Brad Nicholson 2006-06-13 18:26:04 Re: SQL for removing duplicates?