Re: apply_scanjoin_target_to_paths and partitionwise join

From: Arne Roland <arne(dot)roland(at)malkut(dot)net>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Ashutosh Bapat <ashutosh(dot)bapat(dot)oss(at)gmail(dot)com>, Andrei Lepikhov <lepihov(at)gmail(dot)com>, Jakub Wartak <jakub(dot)wartak(at)enterprisedb(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Julien Rouhaud <rjuju123(at)gmail(dot)com>, Etsuro Fujita <etsuro(dot)fujita(at)gmail(dot)com>, Richard Guo <guofenglinux(at)gmail(dot)com>
Subject: Re: apply_scanjoin_target_to_paths and partitionwise join
Date: 2025-10-30 01:23:22
Message-ID: 527dee29-2fd1-4a97-b1bc-d16b9db120c9@malkut.net
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-hackers


On 2025-10-29 19:55, Robert Haas wrote:
> On Wed, Oct 29, 2025 at 2:06 PM Arne Roland<arne(dot)roland(at)malkut(dot)net> wrote:
>> This virtually equivalent query issue occurs when the join condition is
>> (almost) unique. The different amount of tuples to process clearly
>> occurs when they are not.
> I'm having trouble interpreting this. If it's important, please
> clarify and show an example.

Thank you for asking. I hope my explanations are clear. If not I am
happy to explain a particular thing in more detail.

The main factor of your example is, that the amount of rows handled by
the (Merge) Append is different.

In the partition wise join we process a lot of rows, namely 300060003:

Aggregate  (cost=12130962.32..12130962.33 rows=1 width=8)
  ->  Merge Append  (cost=0.88..8380212.28 rows=300060003 width=34)
        Sort Key: t1.a
        ->  Nested Loop  (cost=0.29..1500664.33 rows=100020001 width=34)
              Join Filter: (t1_1.a = t2_1.a)
              ->  Index Only Scan using dupfest1_a_idx on dupfest1 t1_1
 (cost=0.29..194.30 rows=10001 width=2)
              ->  Materialize  (cost=0.00..195.01 rows=10001 width=2)
                    ->  Seq Scan on dupfest1 t2_1  (cost=0.00..145.01
rows=10001 width=2)
        ->  Nested Loop  (cost=0.29..1500664.33 rows=100020001 width=34)
              Join Filter: (t1_2.a = t2_2.a)
              ->  Index Only Scan using dupfest2_a_idx on dupfest2 t1_2
 (cost=0.29..194.30 rows=10001 width=2)
              ->  Materialize  (cost=0.00..195.01 rows=10001 width=2)
                    ->  Seq Scan on dupfest2 t2_2  (cost=0.00..145.01
rows=10001 width=2)
        ->  Nested Loop  (cost=0.29..1500664.33 rows=100020001 width=34)
              Join Filter: (t1_3.a = t2_3.a)
              ->  Index Only Scan using dupfest3_a_idx on dupfest3 t1_3
 (cost=0.29..194.30 rows=10001 width=2)
              ->  Materialize  (cost=0.00..195.01 rows=10001 width=2)
                    ->  Seq Scan on dupfest3 t2_3  (cost=0.00..145.01
rows=10001 width=2)

In the non partitioned case we have less rows, because we have *more*
rows after joining the two relations, because the join has more rows,
than either of the partitioned tables had before. The Append only
processes 30003 rows.

Aggregate  (cost=8253191.53..8253191.54 rows=1 width=8) (actual
time=64208.334..64208.337 rows=1 loops=1)
  ->  Merge Join  (cost=1.71..4502441.21 rows=300060025 width=34)
(actual time=28.900..51731.558 rows=300060003 loops=1)
        Merge Cond: (t1.a = t2.a)
        ->  Append  (cost=0.86..732.91 rows=30003 width=2) (actual
time=0.036..7.044 rows=30003 loops=1)
              ->  Index Only Scan using dupfest1_a_idx on dupfest1 t1_1
 (cost=0.29..194.30 rows=10001 width=2) (actual time=0.034..1.436
rows=10001 loops=1)
              ->  Index Only Scan using dupfest2_a_idx on dupfest2 t1_2
 (cost=0.29..194.30 rows=10001 width=2) (actual time=0.011..1.513
rows=10001 loops=1)
              ->  Index Only Scan using dupfest3_a_idx on dupfest3 t1_3
 (cost=0.29..194.30 rows=10001 width=2) (actual time=0.011..1.408
rows=10001 loops=1)
        ->  Materialize  (cost=0.86..807.92 rows=30003 width=2) (actual
time=0.014..13787.902 rows=300050003 loops=1)
              ->  Append  (cost=0.86..732.91 rows=30003 width=2)
(actual time=0.011..4.225 rows=30003 loops=1)
                    ->  Index Only Scan using dupfest1_a_idx on
dupfest1 t2_1  (cost=0.29..194.30 rows=10001 width=2) (actual
time=0.010..0.698 rows=10001 loops=1)
                    ->  Index Only Scan using dupfest2_a_idx on
dupfest2 t2_2  (cost=0.29..194.30 rows=10001 width=2) (actual
time=0.005..0.708 rows=10001 loops=1)
                    ->  Index Only Scan using dupfest3_a_idx on
dupfest3 t2_3  (cost=0.29..194.30 rows=10001 width=2) (actual
time=0.006..0.776 rows=10001 loops=1)

A very common case I have seen for partitionwise joins is some foreign
key structure. There are several design paradigms, that make it common.
If you join across some foreign key structure, the amount of tuples
doesn't increase. Think the table definition of

create table dupfest (a text, id bigint not null generated always as
identity, primary key (a, id), foreign key (a, id) references dupfest
(a, id)) partition by range(a);

With the query

select count(*) from (select * from dupfest t1, dupfest t2 where t1.a =
t2.a and t1.id = t2.id order by t1.a offset 0);

If we join alongside a foreign key from t1 to t2, we know that the join
can't contain more tuples than t1 did. This may seem like a very special
case, but it's as far as enable_partitionwise_join = true is concerned
definitely a common use case.

If we remove the order condition these cases also have very similar
performance behavior (The difference in time is less than the noise
threshold of my benchmark (regardless of the amount of data as long as
work_mem is sufficiently large).):

select count(*) from (select * from dupfest t1, dupfest t2 where t1.a
= t2.a and t1.id = t2.id offset 0);

If we enter 10 times the data, we see a only marginal difference in the
cost. (And similar performance differences, unless I do the explain
analyze, which ruins the time readings again.)

My second sentence just captured the mundane observation, if the join
has significantly more tuples, than any base relation, the place of the
(Merge) Append might be more relevant. If I join everything with a
generate_series(1, 30000) I get more tuples to process.

I'd like to make one more side note about this example: The planner
punishes the partitionwise join for having an extra node, that emits N
rows (three Hash joins + Append vs two Appends + Hash Join). This plan
is chosen because of the cpu_tuple_cost. I'm happy it picks the plan
with the smaller memory footprint, but in my real world experience for a
timing based approach the default cpu_tuple_cost tends to be too high to
get a fair comparison between partitionwise and non partitionwise joins.

All the best
Arne

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message David Rowley 2025-10-30 01:51:52 Re: Fix bogus use of "long" in aset.c
Previous Message David E. Wheeler 2025-10-30 00:49:35 Re: abi-compliance-check failure due to recent changes to pg_{clear,restore}_{attribute,relation}_stats()