Quick Links

Re: apply_scanjoin_target_to_paths and partitionwise join

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Arne Roland <arne(dot)roland(at)malkut(dot)net>
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Ashutosh Bapat <ashutosh(dot)bapat(dot)oss(at)gmail(dot)com>, Andrei Lepikhov <lepihov(at)gmail(dot)com>, Jakub Wartak <jakub(dot)wartak(at)enterprisedb(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Julien Rouhaud <rjuju123(at)gmail(dot)com>, Etsuro Fujita <etsuro(dot)fujita(at)gmail(dot)com>, Richard Guo <guofenglinux(at)gmail(dot)com>
Subject:	Re: apply_scanjoin_target_to_paths and partitionwise join
Date:	2025-10-29 16:47:23
Message-ID:	CA+TgmoY9egbm1qehv=aSp+cwckOdbbdRePaXpBRSdL6PWDjvuQ@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

On Wed, Oct 29, 2025 at 8:47 AM Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
> I think the best shot at coming up with a
> reproducer here is to study the cost differences in the queries where
> the plan changes with the fix, particularly Q1 from my prior email.

So, the really interesting thing about Q1 is that it contains a join
which inflates the row count by a factor of about 84. We first join
t1, which has 1001 rows, to t3, which has 1001 rows, and get 1001
rows. Then we join to t2, which also has 1001 rows, and we get 83503
rows. It is estimated to be mildly more efficient to perform the t1-t3
join partition-wise: it costs 98.69 cost units to do it partitionwise
and 104.56 cost units to do it non-partitionwise. However, the planner
believes that doing the subsequent join to t2 in partitionwise fashion
is a bad idea. The query has an ORDER BY clause, which means that
after we finish doing the partitionwise part of the operation, we need
to perform a Merge Append to restore the sort order. If we do the
Merge Append before joining to t2, we only have to Merge Append 1001
rows, but if we do the Merge Append after joining to t2, we have to
Merge Append 83530 rows. The estimated cost of Merge Append is
proportional to the number of input rows, so doing that MergeAppend
after the join to t2 is estimated to be 84 times as expensive. In some
situations, we might make up for that loss by being able to do the
join more efficiently, but here the planner does not believe that to
be the case case: we can do the join to t2 by a Merge Join over an
Append over one index scan per partition, and there's basically no
overhead vs. a partitionwise join. Hence, from the planner's point of
view, doing the join to t2 in partitionwise fashion is a significant
loss.

I had difficulty reproducing this theoretical performance regression
in practice. I found that doing the whole thing partitionwise, doing
the whole thing non-partitionwise, and doing only the t1-t3 join
partitionwise weren't that different in runtime, and the partitionwise
approach actually seemed to be a little faster. But I constructed the
following example, similar to but simpler than the one in the
regression tests, which does show a regression for me in practice:

drop table if exists dupfest;
create table dupfest (a text) partition by range(a);
create table dupfest1 partition of dupfest for values from ('0') to ('3');
create table dupfest2 partition of dupfest for values from ('3') to ('6');
create table dupfest3 partition of dupfest for values from ('6') to ('A');
insert into dupfest
select '0' from generate_series(0, 10000) i
union all
select '3' from generate_series(0, 10000) i
union all
select '6' from generate_series(0, 10000) i;
create index on dupfest(a);
analyze dupfest;
set max_parallel_workers_per_gather = 0;

My test query was:

select count(*) from (select * from dupfest t1, dupfest t2 where t1.a
= t2.a order by t1.a offset 0);

I want to start by saying that I haven't tried super-hard to do
rigorous benchmarking. This is a debug-enabled, assertion-enabled
build with patched source code. Results may not be fully
representative. But here's what I found. EXPLAIN ANALYZE of this query
without a partitionwise join took 30.8 seconds; without EXPLAIN
ANALYZE, it ran in 15.4 seconds. With partitionwise join, EXPLAIN
ANALYZE of the query ran in 89.6 seconds; without EXPLAIN ANALYZE, it
ran in 64.5 seconds. The plan without partitionwise join was a merge
join with an append of index only scans on both sides and a
materialize node on the inner side. With partitionwise join, it
switched to Nested Loop plans with index-only scans on the outer side
and a materialize node over a sequential scan on the inner side,
followed by a Merge Append.

A notable point here is that the joins take about the same amount of
time in both plans. In the EXPLAIN ANALYZE output, we see the three
joins in the partitionwise plan taking a total of 24.6 seconds, and
the single join in the non-partitionwise plan taking 24 seconds
(exclusive of times for child nodes). However, the two Append nodes in
the non-partitionwise plan run for a total of 2.5 *milliseconds* while
the single Merge Append node in the partitionwise plan runs for 58.2
seconds (again, exclusive of times for child nodes). Obviously,
EXPLAIN ANALYZE distorts the runtime a lot, but I think the overall
point is nonetheless fairly clear: running a lot of tuples through a
Merge Append node is potentially expensive, and it can be worth
eschewing a partitionwise join to avoid that.

I also tried running the same test without the "order by t1.a". With
that change EXPLAIN ANALYZE took 24.3 seconds without partitionwise
join and 34.8 seconds with partitionwise join. The times without
EXPLAIN ANALYZE were quite close, around 15 seconds either way, but it
looks to me as though the partitionwise plan was probably still a bit
worse. What I think is happening here is that even running a large
number of tuples through Append can have enough overhead to matter in
extreme cases, but EXPLAIN ANALYZE significantly increases the cost of
entering and exiting nodes, so in that case the difference is much
easier to measure.

I don't know whether the EDB customer problem that started this thread
was of the same type demonstrated here or not. It may well have been
something else. However, unless I've fouled up the test case shown
above in some way, which is not impossible, this does demonstrate that
it is possible, at least in corner cases, to run into scenarios where
a partitionwise join is worse than a non-partitionwise join. In this
example, the reason it's worse is because postponing the MergeAppend
until a later stage results in the MergeAppend seeing a much larger
number of rows.

--
Robert Haas
EDB: http://www.enterprisedb.com

In response to

Re: apply_scanjoin_target_to_paths and partitionwise join at 2025-10-29 12:47:45 from Robert Haas

Responses

Re: apply_scanjoin_target_to_paths and partitionwise join at 2025-10-29 18:06:17 from Arne Roland

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Tom Lane	2025-10-29 16:48:23	Re: LISTEN/NOTIFY bug: VACUUM sets frozenxid past a xid in async queue
Previous Message	Arkady Skvorcov	2025-10-29 16:25:00	[PATCH] Implement dynamic predicate lock ratio limits