Skip site navigation (1) Skip section navigation (2)

Using ctid column changes plan drastically

From: Thomas Kellerer <spam_eater(at)gmx(dot)net>
To: pgsql-performance(at)postgresql(dot)org
Subject: Using ctid column changes plan drastically
Date: 2012-07-24 10:13:09
Message-ID: julsfr$otg$1@dough.gmane.org (view raw or flat)
Thread:
Lists: pgsql-performance
Hi,

I was testing a query to delete duplicates to see how well using ctid works if the table doesn't have a unique identifier available.

The table definition is:

create table dupes
(
    id integer primary key,
    first_name text,
    last_name text
);

My test table has 100.000 rows with ~13000 being actually unique.

The following statement:

DELETE FROM dupes
WHERE id NOT IN (SELECT min(b.id)
                  FROM   dupes b
                  GROUP BY first_name, last_Name
                  HAVING count(*) > 1);

produces a quite nice execution plan:

Delete on public.dupes  (cost=2770.00..4640.00 rows=50000 width=6) (actual time=299.809..299.809 rows=0 loops=1)
   Buffers: shared hit=88100
   ->  Seq Scan on public.dupes  (cost=2770.00..4640.00 rows=50000 width=6) (actual time=150.113..211.340 rows=86860 loops=1)
         Output: dupes.ctid
         Filter: (NOT (hashed SubPlan 1))
         Buffers: shared hit=1240
         SubPlan 1
           ->  HashAggregate  (cost=2620.00..2745.00 rows=10000 width=18) (actual time=115.739..143.004 rows=13140 loops=1)
                 Output: min(b.id), b.first_name, b.last_name
                 Filter: (count(*) > 1)
                 Buffers: shared hit=620
                 ->  Seq Scan on public.dupes b  (cost=0.00..1620.00 rows=100000 width=18) (actual time=0.006..15.563 rows=100000 loops=1)
                       Output: b.id, b.first_name, b.last_name
                       Buffers: shared hit=620
Total runtime: 301.241 ms

Now assuming I do not have a unique value in the table. In that case I would revert to using the ctid to identify individual rows:

DELETE FROM dupes
WHERE ctid NOT IN (SELECT min(b.ctid)
                    FROM   dupes b
                    GROUP BY first_name, last_Name
                    HAVING count(*) > 1);

Which has a completely different execution plan:

Delete on public.dupes  (cost=2620.00..10004490.00 rows=50000 width=6) (actual time=269966.623..269966.623 rows=0 loops=1)
   Buffers: shared hit=88720
   ->  Seq Scan on public.dupes  (cost=2620.00..10004490.00 rows=50000 width=6) (actual time=176.107..269582.651 rows=86860 loops=1)
         Output: dupes.ctid
         Filter: (NOT (SubPlan 1))
         Buffers: shared hit=1240
         SubPlan 1
           ->  Materialize  (cost=2620.00..2795.00 rows=10000 width=20) (actual time=0.002..0.799 rows=12277 loops=100000)
                 Output: (min(b.ctid)), b.first_name, b.last_name
                 Buffers: shared hit=620
                 ->  HashAggregate  (cost=2620.00..2745.00 rows=10000 width=20) (actual time=131.162..164.941 rows=13140 loops=1)
                       Output: min(b.ctid), b.first_name, b.last_name
                       Filter: (count(*) > 1)
                       Buffers: shared hit=620
                       ->  Seq Scan on public.dupes b  (cost=0.00..1620.00 rows=100000 width=20) (actual time=0.005..29.531 rows=100000 loops=1)
                             Output: b.ctid, b.first_name, b.last_name
                             Buffers: shared hit=620
Total runtime: 269968.515 ms

This is Postgres 9.1.4 64bit on Windows 7

Why does the usage of the CTID column change the plan so drastically?

Regards
Thomas


Responses

pgsql-performance by date

Next:From: AI RummanDate: 2012-07-24 10:42:34
Subject: Why do I need more time with partition table?
Previous:From: Jim VannsDate: 2012-07-24 09:34:11
Subject: Re: Odd blocking (or massively latent) issue - even with EXPLAIN

Privacy Policy | About PostgreSQL
Copyright © 1996-2014 The PostgreSQL Global Development Group