Re: apply_scanjoin_target_to_paths and partitionwise join

From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Ashutosh Bapat <ashutosh(dot)bapat(dot)oss(at)gmail(dot)com>, Andrei Lepikhov <lepihov(at)gmail(dot)com>, Jakub Wartak <jakub(dot)wartak(at)enterprisedb(dot)com>, Tomas Vondra <tomas(dot)vondra(at)enterprisedb(dot)com>, arne(dot)roland(at)malkut(dot)net, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Julien Rouhaud <rjuju123(at)gmail(dot)com>, Etsuro Fujita <etsuro(dot)fujita(at)gmail(dot)com>, Richard Guo <guofenglinux(at)gmail(dot)com>
Subject: Re: apply_scanjoin_target_to_paths and partitionwise join
Date: 2025-10-27 21:12:26
Message-ID: CA+TgmoZvBD+5vyQruXBVXW74FMgWxE=O4K4rCrCtEELWNj-MLA@mail.gmail.com
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Thu, Jan 2, 2025 at 3:58 PM Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> I am wondering if the problem is not that the plan is slower, it's
> that for some reason the planner took a lot longer to create it.
> It's very plausible that partitionwise planning takes longer, and
> maybe we have some corner cases where the time is O(N^2) or worse.
>
> However, this is pure speculation without a test case, and any
> proposed fix would be even more speculative. I concur with your
> bottom line: we should insist on a public test case before deciding
> what to do about it.

I had the humbling experience of rediscovering this problem today and
then, shortly thereafter, realizing that Ashutosh's originally
explanation of the problem was correct and that I simply failed to
understand it at the time. Consider this query from the regression
tests, which Ashutosh's patch adjusts:

SELECT t1.a, t1.c, t2.b, t2.c FROM prt1 t1, prt2 t2 WHERE t1.a = t2.a
AND t1.a = t2.b ORDER BY t1.a, t2.b;

Right now, the regression tests get this plan:

Sort (cost=11.52..11.53 rows=3 width=18)
Sort Key: t1.a
-> Append (cost=2.20..11.50 rows=3 width=18)
-> Merge Join (cost=2.20..3.58 rows=1 width=18)
Merge Cond: (t1_1.a = t2_1.a)
-> Index Scan using iprt1_p1_a on prt1_p1 t1_1
(cost=0.14..14.02 rows=125 width=9)
-> Sort (cost=2.06..2.06 rows=1 width=13)
Sort Key: t2_1.b
-> Seq Scan on prt2_p1 t2_1 (cost=0.00..2.05
rows=1 width=13)
Filter: (a = b)
-> Hash Join (cost=2.05..4.78 rows=1 width=18)
Hash Cond: (t1_2.a = t2_2.a)
-> Seq Scan on prt1_p2 t1_2 (cost=0.00..2.25 rows=125 width=9)
-> Hash (cost=2.04..2.04 rows=1 width=13)
-> Seq Scan on prt2_p2 t2_2 (cost=0.00..2.04
rows=1 width=13)
Filter: (a = b)
-> Hash Join (cost=1.43..3.12 rows=1 width=18)
Hash Cond: (t1_3.a = t2_3.a)
-> Seq Scan on prt1_p3 t1_3 (cost=0.00..1.50 rows=50 width=9)
-> Hash (cost=1.41..1.41 rows=1 width=13)
-> Seq Scan on prt2_p3 t2_3 (cost=0.00..1.41
rows=1 width=13)
Filter: (a = b)

But if you disable paritionwise join for this query, you get this plan:

Merge Join (cost=5.99..7.99 rows=3 width=18)
Merge Cond: (t1.a = t2.a)
-> Merge Append (cost=0.45..44.83 rows=300 width=9)
Sort Key: t1.a
-> Index Scan using iprt1_p1_a on prt1_p1 t1_1
(cost=0.14..14.02 rows=125 width=9)
-> Index Scan using iprt1_p2_a on prt1_p2 t1_2
(cost=0.14..14.02 rows=125 width=9)
-> Index Scan using iprt1_p3_a on prt1_p3 t1_3
(cost=0.14..12.89 rows=50 width=9)
-> Sort (cost=5.54..5.55 rows=3 width=13)
Sort Key: t2.b
-> Append (cost=0.00..5.51 rows=3 width=13)
-> Seq Scan on prt2_p1 t2_1 (cost=0.00..2.05 rows=1 width=13)
Filter: (a = b)
-> Seq Scan on prt2_p2 t2_2 (cost=0.00..2.04 rows=1 width=13)
Filter: (a = b)
-> Seq Scan on prt2_p3 t2_3 (cost=0.00..1.41 rows=1 width=13)
Filter: (a = b)

This demonstrates that, in this case, setting
enable_partitionwise_join = on causes the planner to switch to a plan
with a higher total cost, which obviously shouldn't happen. It's still
unclear to me how this can ever cause a slowdown of the magnitude that
is reported to have happened in an EDB customer scenario, but I think
this example is sufficient to prove that the current behavior is
fundamentally incorrect. The issue is that when
apply_scanjoin_target_to_paths() sets the pathlist to NIL, it does so
on the assumption that all the same paths will be re-added afterwards,
except with the correct scan/join target. But for partitionwise join
relations, this is not true: any Append and MergeAppend paths will be
re-added afterwards, but any non-partitionwise joins are only added
before the pathlist is zapped, and therefore the planner loses the
ability to consider them when the pathlist is reset. Said differently,
the existing comment would be correct if *all* of the paths for the
rel had to be Append/MergeAppend paths, which is true for a
partitioned baserel (I think) but untrue for a partitioned joinrel
(for sure).

I haven't had a chance just yet to think through all the details of
the proposed patch, but I now believe we should commit something along
those lines. I still suspect that back-patching is unwise; even though
I now agree with Ashutosh's claim that this is a bug, because previous
experience with destabilizing plans in back-branches has not been
good. Hence, I'm inclined to fix only master. I do think the comments
in the patch need some work, and I plan to tackle that tomorrow.

--
Robert Haas
EDB: http://www.enterprisedb.com

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Nathan Bossart 2025-10-27 21:15:23 Re: another autovacuum scheduling thread
Previous Message Nathan Bossart 2025-10-27 21:00:01 Re: display hot standby state in psql prompt