Re: Using ctid column changes plan drastically

From: Thomas Kellerer <spam_eater(at)gmx(dot)net>
To: pgsql-performance(at)postgresql(dot)org
Subject: Re: Using ctid column changes plan drastically
Date: 2012-07-25 08:10:13
Message-ID: juo9lb$4hm$1@dough.gmane.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance

Tom Lane, 24.07.2012 19:12:
> Well, it would only help if you're running a PG version that's new
> enough to recognize the NOT EXISTS as an anti-join; and even then,
> it's possible that joining on a tid column forecloses enough plan
> types that you don't get any real benefit. But I'm just guessing.
> Can you show exactly what you tried and what EXPLAIN ANALYZE results
> you got?
>

I am using 9.1.4 (as I said in my initial post).

I finally found a solution that runs fine:

DELETE FROM dupes a
WHERE EXISTS (SELECT 1
FROM dupes b
WHERE b.first_name = a.first_name
AND b.last_name = a.last_name
AND b.ctid > a.ctid);

The execution plan for this is:

Delete on public.dupes a (cost=14575.95..16978.87 rows=25000 width=12) (actual time=2419.334..2419.334 rows=0 loops=1)
Buffers: shared hit=18029
-> Merge Semi Join (cost=14575.95..16978.87 rows=25000 width=12) (actual time=2043.674..2392.707 rows=17097 loops=1)
Output: a.ctid, b.ctid
Merge Cond: ((a.first_name = b.first_name) AND (a.last_name = b.last_name))
Join Filter: (b.ctid > a.ctid)
Buffers: shared hit=930
-> Sort (cost=7287.98..7475.48 rows=75000 width=20) (actual time=1024.195..1030.051 rows=75000 loops=1)
Output: a.ctid, a.first_name, a.last_name
Sort Key: a.first_name, a.last_name
Sort Method: quicksort Memory: 8870kB
Buffers: shared hit=465
-> Seq Scan on public.dupes a (cost=0.00..1215.00 rows=75000 width=20) (actual time=0.025..23.234 rows=75000 loops=1)
Output: a.ctid, a.first_name, a.last_name
Buffers: shared hit=465
-> Sort (cost=7287.98..7475.48 rows=75000 width=20) (actual time=1019.148..1028.483 rows=105841 loops=1)
Output: b.ctid, b.first_name, b.last_name
Sort Key: b.first_name, b.last_name
Sort Method: quicksort Memory: 8870kB
Buffers: shared hit=465
-> Seq Scan on public.dupes b (cost=0.00..1215.00 rows=75000 width=20) (actual time=0.017..19.133 rows=75000 loops=1)
Output: b.ctid, b.first_name, b.last_name
Buffers: shared hit=465
Total runtime: 2420.953 ms

Which is a lot better than the plan using "WHERE ctid NOT IN (.....)":

Delete on public.dupes (cost=1777.50..4925055.00 rows=37500 width=6) (actual time=582515.094..582515.094 rows=0 loops=1)
Buffers: shared hit=18027
-> Seq Scan on public.dupes (cost=1777.50..4925055.00 rows=37500 width=6) (actual time=1038.164..582332.927 rows=17097 loops=1)
Output: dupes.ctid
Filter: (NOT (SubPlan 1))
Buffers: shared hit=930
SubPlan 1
-> Materialize (cost=1777.50..1890.00 rows=7500 width=20) (actual time=0.001..2.283 rows=35552 loops=75000)
Output: (min(b.ctid)), b.first_name, b.last_name
Buffers: shared hit=465
-> HashAggregate (cost=1777.50..1852.50 rows=7500 width=20) (actual time=90.964..120.228 rows=57903 loops=1)
Output: min(b.ctid), b.first_name, b.last_name
Buffers: shared hit=465
-> Seq Scan on public.dupes b (cost=0.00..1215.00 rows=75000 width=20) (actual time=0.008..25.515 rows=75000 loops=1)
Output: b.ctid, b.first_name, b.last_name
Buffers: shared hit=465
Total runtime: 582517.711 ms

Using "WHERE id NOT IN (...)" is the fastest way:

Delete on public.dupes (cost=1871.25..3273.75 rows=37500 width=6) (actual time=187.949..187.949 rows=0 loops=1)
Buffers: shared hit=18490
-> Seq Scan on public.dupes (cost=1871.25..3273.75 rows=37500 width=6) (actual time=125.351..171.108 rows=17097 loops=1)
Output: dupes.ctid
Filter: (NOT (hashed SubPlan 1))
Buffers: shared hit=930
SubPlan 1
-> HashAggregate (cost=1777.50..1852.50 rows=7500 width=18) (actual time=73.131..93.421 rows=57903 loops=1)
Output: min(b.id), b.first_name, b.last_name
Buffers: shared hit=465
-> Seq Scan on public.dupes b (cost=0.00..1215.00 rows=75000 width=18) (actual time=0.004..8.515 rows=75000 loops=1)
Output: b.id, b.first_name, b.last_name
Buffers: shared hit=465
Total runtime: 189.222 ms

Regards
Thomas

In response to

Responses

Browse pgsql-performance by date

  From Date Subject
Next Message AI Rumman 2012-07-25 08:40:33 Re: Why do I need more time with partition table?
Previous Message Aleksei Arefjev 2012-07-25 07:37:54 Re: transactions start time