Re: BUG #18921: The larger the value of max_parallel_workers_per_gather, the longer the SQL execution time.

From: Dilip Kumar <dilipbalaut(at)gmail(dot)com>
To: xiaohongjun(at)stu(dot)xidian(dot)edu(dot)cn, pgsql-bugs(at)lists(dot)postgresql(dot)org
Subject: Re: BUG #18921: The larger the value of max_parallel_workers_per_gather, the longer the SQL execution time.
Date: 2025-05-11 08:32:39
Message-ID: CAFiTN-sGZjPkFtregsKu65ab1L-_jaADfOmRav9w3Wtx0D7Lrg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs

On Sat, May 10, 2025 at 6:18 PM PG Bug reporting form
<noreply(at)postgresql(dot)org> wrote:
>
> The following bug has been logged on the website:
>
> Bug reference: 18921
> Logged by: hongjun xiao
> Email address: xiaohongjun(at)stu(dot)xidian(dot)edu(dot)cn
> PostgreSQL version: 17.4
> Operating system: Linux-Ubuntu 20.04.6 LTS
> Description:
>
> -- performance bug for query: SELECT ALL t0.c0, t1.c0 FROM ONLY t0 RIGHT

Thanks for reporting this. I tried to analyze the issue, and based on
my analysis, it doesn’t appear to be a parallelism problem. Instead,
it looks like the root cause is a poor plan choice at the bottom level
of the plan, due to inaccurate statistics for one of the tables, which
then caused everything else to go off track. I have given a more
detailed analysis below.

So I could see that the parallel plan is slower than the normal plan,
and then I tried to investigate it further. I realized the root cause
of the issue is not a parallelism but an estimation issue at the very
bottom end of the plan, as shown below[1], the "Seq Scan on public.t1"
is estimating 1 row. In contrast, there are 75 rows, and this is an
outer table of the nested loop join, and because of this
underestimation, no materialization is chosen on the inner side, which
makes the whole plan execution worse, see complete plan [2].

I just tried to disable the nested loop join to see what changes, so
the execution time reduced from 13.764 ms to 0.621 ms [3], although we
still have the nested loop plans but by turning it off make planner
choose some other combination of the nested loop plan and the outer
node "Seq Scan on public.t1" become inner node and then the problem
due to underestimation is gone because now we are not looping over
inner table for 75 times. This plan turns out to be superior in many
other ways, although I did not analyze all aspects of this plan
change.

For reference, I'm also including a parallel plan [4], where the
nested loop is enabled. While it's slightly slower than the
non-parallel plan [2], as mentioned earlier, the real issue is that it
falls victim to a bad plan caused by inaccurate statistics.

[1]
-> Nested Loop (cost=0.00..4.12 rows=1
width=26) (actual time=13.486..13.486 rows=0 loops=1)
Output: t0.c0, t1.c0
Join Filter: ((t0.c0 <= t1.c0) AND
(((t0.c0 + (t0.c0 + t1.c0)) * ((t0.c0 + t1.c0) + t1.c0)) >= t1.c0) AND
(((t0.c0 + (t0.c0 + t1.c0)) * ((t0.c0 + t1.c0) + t1.c0)) <= ((t0.c0 +
(t0.c0 + '[-1506176850,461487478)'::int4range)) + (t0.c0 +
('[74602774,1501850467)'::int4range + t0.c0)))))
-> Seq Scan on public.t1
(cost=0.00..2.00 rows=1 width=13) (actual time=0.028..0.114 rows=75
loops=1)
Output: t1.c0
Filter: (t1.c0 && t1.c0)
Rows Removed by Filter: 5
-> Seq Scan on public.t0
(cost=0.00..1.90 rows=4 width=13) (actual time=0.176..0.176 rows=0
loops=75)
Output: t0.c0
Filter: ((t0.c0 >= t0.c0) AND
(t0.c0 > ((t0.c0 * ('[-2011769718,907690705)'::int4range * t0.c0)) +
t0.c0)))
Rows Removed by Filter: 40

[2]
HashAggregate (cost=3908.21..3908.25 rows=4 width=26) (actual
time=13.595..13.598 rows=0 loops=1)
Output: t0.c0, t1.c0
Group Key: t0.c0, t1.c0
Batches: 1 Memory Usage: 24kB
-> Nested Loop (cost=0.00..2804.21 rows=220800 width=26) (actual
time=13.591..13.593 rows=0 loops=1)
Output: t0.c0, t1.c0
-> Nested Loop (cost=0.00..42.21 rows=2760 width=26)
(actual time=13.590..13.592 rows=0 loops=1)
Output: t0.c0, t1.c0
-> Seq Scan on public.t2 sub0_t2 (cost=0.00..1.69
rows=69 width=0) (actual time=0.067..0.077 rows=69 loops=1)
Output: sub0_t2.c0
-> Materialize (cost=0.00..6.12 rows=40 width=26)
(actual time=0.196..0.196 rows=0 loops=69)
Output: t0.c0, t1.c0
-> Nested Loop (cost=0.00..5.92 rows=40
width=26) (actual time=13.486..13.487 rows=0 loops=1)
Output: t0.c0, t1.c0
-> Nested Loop (cost=0.00..4.12 rows=1
width=26) (actual time=13.486..13.486 rows=0 loops=1)
Output: t0.c0, t1.c0
Join Filter: ((t0.c0 <= t1.c0) AND
(((t0.c0 + (t0.c0 + t1.c0)) * ((t0.c0 + t1.c0) + t1.c0)) >= t1.c0) AND
(((t0.c0 + (t0.c0 + t1.c0)) * ((t0.c0 + t1.c0) + t1.c0)) <= ((t0.c0 +
(t0.c0 + '[-1506176850,461487478)'::int4range)) + (t0.c0 +
('[74602774,1501850467)'::int4range + t0.c0)))))
-> Seq Scan on public.t1
(cost=0.00..2.00 rows=1 width=13) (actual time=0.028..0.114 rows=75
loops=1)
Output: t1.c0
Filter: (t1.c0 && t1.c0)
Rows Removed by Filter: 5
-> Seq Scan on public.t0
(cost=0.00..1.90 rows=4 width=13) (actual time=0.176..0.176 rows=0
loops=75)
Output: t0.c0
Filter: ((t0.c0 >= t0.c0) AND
(t0.c0 > ((t0.c0 * ('[-2011769718,907690705)'::int4range * t0.c0)) +
t0.c0)))
Rows Removed by Filter: 40
-> Seq Scan on public.t0 sub0_t0
(cost=0.00..1.40 rows=40 width=0) (never executed)
Output: sub0_t0.c0
-> Materialize (cost=0.00..2.20 rows=80 width=0) (never executed)
-> Seq Scan on public.t1 sub0_t1 (cost=0.00..1.80
rows=80 width=0) (never executed)
Planning Time: 0.885 ms
Execution Time: 13.764 ms
(31 rows)

[3]
Group (cost=40000027688.14..40000029344.14 rows=4 width=26) (actual
time=0.364..0.368 rows=0 loops=1)
Output: t0.c0, t1.c0
Group Key: t0.c0, t1.c0
-> Sort (cost=40000027688.14..40000028240.14 rows=220800
width=26) (actual time=0.363..0.366 rows=0 loops=1)
Output: t0.c0, t1.c0
Sort Key: t0.c0, t1.c0
Sort Method: quicksort Memory: 25kB
-> Nested Loop (cost=40000000000.00..40000002804.51
rows=220800 width=26) (actual time=0.324..0.327 rows=0 loops=1)
Output: t0.c0, t1.c0
-> Nested Loop (cost=30000000000.00..30000000042.51
rows=2760 width=26) (actual time=0.324..0.326 rows=0 loops=1)
Output: t0.c0, t1.c0
-> Nested Loop
(cost=20000000000.00..20000000006.51 rows=69 width=26) (actual
time=0.323..0.325 rows=0 loops=1)
Output: t0.c0, t1.c0
-> Nested Loop
(cost=10000000000.00..10000000004.13 rows=1 width=26) (actual
time=0.323..0.324 rows=0 loops=1)
Output: t0.c0, t1.c0
Join Filter: ((t0.c0 <= t1.c0) AND
(((t0.c0 + (t0.c0 + t1.c0)) * ((t0.c0 + t1.c0) + t1.c0)) >= t1.c0) AND
(((t0.c0 + (t0.c0 + t1.c0)) * ((t0.c0 + t1.c0) + t1.c0)) <= ((t0.c0 +
(t0.c0 + '[-1506176850,461487478)'::int4range)) + (t0.c0 +
('[74602774,1501850467)'::int4range + t0.c0)))))
-> Seq Scan on public.t0
(cost=0.00..1.90 rows=4 width=13) (actual time=0.322..0.322 rows=0
loops=1)
Output: t0.c0
Filter: ((t0.c0 >= t0.c0) AND
(t0.c0 > ((t0.c0 * ('[-2011769718,907690705)'::int4range * t0.c0)) +
t0.c0)))
Rows Removed by Filter: 40
-> Materialize (cost=0.00..2.00
rows=1 width=13) (never executed)
Output: t1.c0
-> Seq Scan on public.t1
(cost=0.00..2.00 rows=1 width=13) (never executed)
Output: t1.c0
Filter: (t1.c0 && t1.c0)
-> Seq Scan on public.t2 sub0_t2
(cost=0.00..1.69 rows=69 width=0) (never executed)
Output: sub0_t2.c0
-> Materialize (cost=0.00..1.60 rows=40
width=0) (never executed)
-> Seq Scan on public.t0 sub0_t0
(cost=0.00..1.40 rows=40 width=0) (never executed)
-> Materialize (cost=0.00..2.20 rows=80 width=0)
(never executed)
-> Seq Scan on public.t1 sub0_t1
(cost=0.00..1.80 rows=80 width=0) (never executed)
Planning Time: 1.048 ms
Execution Time: 0.621 ms
(33 rows)

[4]
Group (cost=3716.70..3717.68 rows=4 width=26) (actual
time=16.097..16.132 rows=0 loops=1)
Group Key: t0.c0, t1.c0
-> Gather Merge (cost=3716.70..3717.64 rows=8 width=26) (actual
time=16.096..16.130 rows=0 loops=1)
Workers Planned: 2
Workers Launched: 2
-> Sort (cost=2716.68..2716.69 rows=4 width=26) (actual
time=4.968..4.970 rows=0 loops=3)
Sort Key: t0.c0, t1.c0
Sort Method: quicksort Memory: 25kB
Worker 0: Sort Method: quicksort Memory: 25kB
Worker 1: Sort Method: quicksort Memory: 25kB
-> Partial HashAggregate (cost=2716.60..2716.64
rows=4 width=26) (actual time=4.918..4.920 rows=0 loops=3)
Group Key: t0.c0, t1.c0
Batches: 1 Memory Usage: 24kB
Worker 0: Batches: 1 Memory Usage: 24kB
Worker 1: Batches: 1 Memory Usage: 24kB
-> Nested Loop (cost=0.00..2256.60 rows=92000
width=26) (actual time=4.916..4.917 rows=0 loops=3)
-> Parallel Seq Scan on t0 sub0_t0
(cost=0.00..1.17 rows=17 width=0) (actual time=0.016..0.021 rows=13
loops=3)
-> Nested Loop (cost=0.00..77.47
rows=5520 width=26) (actual time=0.366..0.366 rows=0 loops=40)
-> Seq Scan on t1 sub0_t1
(cost=0.00..1.80 rows=80 width=0) (actual time=0.005..0.016 rows=80
loops=40)
-> Materialize (cost=0.00..6.85
rows=69 width=26) (actual time=0.004..0.004 rows=0 loops=3200)
-> Nested Loop
(cost=0.00..6.50 rows=69 width=26) (actual time=13.206..13.206 rows=0
loops=1)
-> Nested Loop
(cost=0.00..4.12 rows=1 width=26) (actual time=13.205..13.206 rows=0
loops=1)
Join Filter:
((t0.c0 <= t1.c0) AND (((t0.c0 + (t0.c0 + t1.c0)) * ((t0.c0 + t1.c0) +
t1.c0)) >= t1.c0) AND (((t0.c0 + (t0.c0 + t1.c0)) * ((t0.c0 + t1.c0) +
t1.c0)) <= ((t0.c0 + (t0.c0 + '[-1506176850,461487478)'::int4range)) +
(t0.c0 + ('[74602774,1501850467)'::int4range + t0.c0)))))
-> Seq Scan on t1
(cost=0.00..2.00 rows=1 width=13) (actual time=0.026..0.109 rows=75
loops=1)
Filter: (c0 && c0)
Rows Removed
by Filter: 5
-> Seq Scan on t0
(cost=0.00..1.90 rows=4 width=13) (actual time=0.173..0.173 rows=0
loops=75)
Filter: ((c0
>= c0) AND (c0 > ((c0 * ('[-2011769718,907690705)'::int4range * c0)) +
c0)))
Rows Removed
by Filter: 40
-> Seq Scan on t2
sub0_t2 (cost=0.00..1.69 rows=69 width=0) (never executed)
Planning Time: 1.112 ms
Execution Time: 16.338 ms
(32 rows)

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

In response to

Browse pgsql-bugs by date

  From Date Subject
Next Message Zane Duffield 2025-05-12 06:18:12 Dependency on ccache in postgresql17-devel-17.5-1PGDG.rhel9
Previous Message Shlok Kyal 2025-05-10 19:19:38 Re: BUG #18897: Logical replication conflict after using pg_createsubscriber under heavy load