Another unexpected planner choice in simple JOIN

From: Mark Kirkwood <mark(dot)kirkwood(at)gmail(dot)com>
To: pgsql-performance(at)lists(dot)postgresql(dot)org
Subject: Another unexpected planner choice in simple JOIN
Date: 2026-01-09 00:12:45
Message-ID: 5874da50-c9c3-4e66-907a-a5b564eabd85@gmail.com
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-performance

I have encountered another one of these. However managed to work out
what the issue was myself this time! However I figured it might be
interesting/useful for folk generally.

Same schema as before, with a few parameter changes to 16devel:

$ cat postgresql.auto.conf
# Do not edit this file manually!
# It will be overwritten by the ALTER SYSTEM command.
parallel_setup_cost = '30000'
shared_buffers = '2GB'
max_wal_size = '4GB'

I don't think any of these actually effect this new query:

test0=# EXPLAIN ANALYZE SELECT t0.id0, t1.val
                FROM tab0 AS t0
                JOIN tab1 AS t1 ON (t0.id0 = t1.id0)
                WHERE t0.id0 < 1000;
    QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------------------------
 Gather  (cost=30041.99..2379226.30 rows=990000 width=98) (actual
time=0.512..6261.801 rows=1000000.00 loops=1)
   Workers Planned: 2
   Workers Launched: 2
   Buffers: shared hit=18285 read=1705870
   ->  Hash Join  (cost=41.99..2250226.30 rows=412500 width=98) (actual
time=158.814..4363.611 rows=333333.33 loops=3)
         Hash Cond: (t1.id0 = t0.id0)
         Buffers: shared hit=18285 read=1705870
         ->  Parallel Seq Scan on tab1 t1 (cost=0.00..2140804.67
rows=41666667 width=98) (actual time=0.055..1830.481 rows=33333333.33
loops=3)
               Buffers: shared hit=18268 read=1705870
         ->  Hash  (cost=29.62..29.62 rows=990 width=4) (actual
time=0.234..0.235 rows=1000.00 loops=3)
               Buckets: 1024  Batches: 1  Memory Usage: 44kB
               Buffers: shared hit=17
               ->  Index Only Scan using tab0_pkey on tab0 t0 
(cost=0.29..29.62 rows=990 width=4) (actual time=0.029..0.119
rows=1000.00 loops=3)
                     Index Cond: (id0 < 1000)
                     Heap Fetches: 0
                     Index Searches: 3
                     Buffers: shared hit=17
 Planning:
   Buffers: shared hit=222
 Planning Time: 1.031 ms
 Execution Time: 6293.675 ms
(21 rows)

This seems like a pretty horrible plan....I'm wondering why it is not
eliminating the vast majority of rows from tab0 1st in some way. After
some head scratching I wondered if *in this case* I really needed to
tell the planner I was on NVMe:

test0=# SET random_page_cost=1;
SET

test0=# EXPLAIN ANALYZE SELECT t0.id0, t1.val
                FROM tab0 AS t0
                JOIN tab1 AS t1 ON (t0.id0 = t1.id0)
                WHERE t0.id0 < 1000;
    QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------------------------
 Gather  (cost=30000.00..1668544.54 rows=990000 width=98) (actual
time=0.198..163.617 rows=1000000.00 loops=1)
   Workers Planned: 1
   Workers Launched: 1
   Buffers: shared hit=32141
   ->  Nested Loop  (cost=0.00..1539544.54 rows=582353 width=98)
(actual time=0.027..110.626 rows=500000.00 loops=2)
         Buffers: shared hit=32141
         ->  Parallel Seq Scan on tab0 t0 (cost=0.00..2375.29 rows=582
width=4) (actual time=0.005..2.689 rows=500.00 loops=2)
               Filter: (id0 < 1000)
               Rows Removed by Filter: 49500
               Buffers: shared hit=1640
         ->  Index Scan using tab1_id0_hash on tab1 t1
(cost=0.00..2608.85 rows=3233 width=98) (actual time=0.003..0.153
rows=1000.00 loops=1000)
               Index Cond: (id0 = t0.id0)
               Index Searches: 1000
               Buffers: shared hit=30501
 Planning:
   Buffers: shared hit=9
 Planning Time: 0.189 ms
 Execution Time: 191.870 ms
(18 rows)

That is more like it! So ahem, note to self: remember to tell the
planner you are using storage where random access is just (or nearly) as
fast as sequential.

Now I can get the planner to use the index on tab0 by making parallel
operation more expensive:

test0=# SET parallel_setup_cost=1000000;
SET

test0=# EXPLAIN ANALYZE SELECT t0.id0, t1.val
                FROM tab0 AS t0
                JOIN tab1 AS t1 ON (t0.id0 = t1.id0)
                WHERE t0.id0 < 1000;
 QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------------------
 Nested Loop  (cost=0.29..2614793.04 rows=990000 width=98) (actual
time=0.027..205.270 rows=1000000.00 loops=1)
   Buffers: shared hit=30505
   ->  Index Only Scan using tab0_pkey on tab0 t0 (cost=0.29..20.62
rows=990 width=4) (actual time=0.014..0.147 rows=1000.00 loops=1)
         Index Cond: (id0 < 1000)
         Heap Fetches: 0
         Index Searches: 1
         Buffers: shared hit=5
   ->  Index Scan using tab1_id0_hash on tab1 t1 (cost=0.00..2608.85
rows=3233 width=98) (actual time=0.003..0.143 rows=1000.00 loops=1000)
         Index Cond: (id0 = t0.id0)
         Index Searches: 1000
         Buffers: shared hit=30500
 Planning:
   Buffers: shared hit=9
 Planning Time: 0.182 ms
 Execution Time: 232.360 ms
(15 rows)

But the parallel plan is better! I guess one question to consider is: is
the slightly faster parallel plan way more expensive in terms of
resources? It seems likely that it is, so if you had a system that
needed to run many instances of this type of query would you be better
off discouraging parallel execution of it? Anyway, I found this an
interesting exercise, hopefully others might too.

regards

Mark

Responses

Browse pgsql-performance by date

  From Date Subject
Next Message Mark Kirkwood 2026-01-09 01:15:10 Re: Another unexpected planner choice in simple JOIN
Previous Message Mark Kirkwood 2026-01-08 04:55:06 Re: Unexpected planner choice in simple JOIN