| From: | Arne Roland <arne(dot)roland(at)malkut(dot)net> | 
|---|---|
| To: | Robert Haas <robertmhaas(at)gmail(dot)com> | 
| Cc: | Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Ashutosh Bapat <ashutosh(dot)bapat(dot)oss(at)gmail(dot)com>, Andrei Lepikhov <lepihov(at)gmail(dot)com>, Jakub Wartak <jakub(dot)wartak(at)enterprisedb(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Julien Rouhaud <rjuju123(at)gmail(dot)com>, Etsuro Fujita <etsuro(dot)fujita(at)gmail(dot)com>, Richard Guo <guofenglinux(at)gmail(dot)com> | 
| Subject: | Re: apply_scanjoin_target_to_paths and partitionwise join | 
| Date: | 2025-10-30 01:23:22 | 
| Message-ID: | 527dee29-2fd1-4a97-b1bc-d16b9db120c9@malkut.net | 
| Views: | Whole Thread | Raw Message | Download mbox | Resend email | 
| Thread: | |
| Lists: | pgsql-hackers | 
On 2025-10-29 19:55, Robert Haas wrote:
> On Wed, Oct 29, 2025 at 2:06 PM Arne Roland<arne(dot)roland(at)malkut(dot)net> wrote:
>> This virtually equivalent query issue occurs when the join condition is
>> (almost) unique. The different amount of tuples to process clearly
>> occurs when they are not.
> I'm having trouble interpreting this. If it's important, please
> clarify and show an example.
Thank you for asking. I hope my explanations are clear. If not I am 
happy to explain a particular thing in more detail.
The main factor of your example is, that the amount of rows handled by 
the (Merge) Append is different.
In the partition wise join we process a lot of rows, namely 300060003:
Aggregate  (cost=12130962.32..12130962.33 rows=1 width=8)
   ->  Merge Append  (cost=0.88..8380212.28 rows=300060003 width=34)
         Sort Key: t1.a
         ->  Nested Loop  (cost=0.29..1500664.33 rows=100020001 width=34)
               Join Filter: (t1_1.a = t2_1.a)
               ->  Index Only Scan using dupfest1_a_idx on dupfest1 t1_1 
  (cost=0.29..194.30 rows=10001 width=2)
               ->  Materialize  (cost=0.00..195.01 rows=10001 width=2)
                     ->  Seq Scan on dupfest1 t2_1  (cost=0.00..145.01 
rows=10001 width=2)
         ->  Nested Loop  (cost=0.29..1500664.33 rows=100020001 width=34)
               Join Filter: (t1_2.a = t2_2.a)
               ->  Index Only Scan using dupfest2_a_idx on dupfest2 t1_2 
  (cost=0.29..194.30 rows=10001 width=2)
               ->  Materialize  (cost=0.00..195.01 rows=10001 width=2)
                     ->  Seq Scan on dupfest2 t2_2  (cost=0.00..145.01 
rows=10001 width=2)
         ->  Nested Loop  (cost=0.29..1500664.33 rows=100020001 width=34)
               Join Filter: (t1_3.a = t2_3.a)
               ->  Index Only Scan using dupfest3_a_idx on dupfest3 t1_3 
  (cost=0.29..194.30 rows=10001 width=2)
               ->  Materialize  (cost=0.00..195.01 rows=10001 width=2)
                     ->  Seq Scan on dupfest3 t2_3  (cost=0.00..145.01 
rows=10001 width=2)
In the non partitioned case we have less rows, because we have *more* 
rows after joining the two relations, because the join has more rows, 
than either of the partitioned tables had before. The Append only 
processes 30003 rows.
Aggregate  (cost=8253191.53..8253191.54 rows=1 width=8) (actual 
time=64208.334..64208.337 rows=1 loops=1)
   ->  Merge Join  (cost=1.71..4502441.21 rows=300060025 width=34) 
(actual time=28.900..51731.558 rows=300060003 loops=1)
         Merge Cond: (t1.a = t2.a)
         ->  Append  (cost=0.86..732.91 rows=30003 width=2) (actual 
time=0.036..7.044 rows=30003 loops=1)
               ->  Index Only Scan using dupfest1_a_idx on dupfest1 t1_1 
  (cost=0.29..194.30 rows=10001 width=2) (actual time=0.034..1.436 
rows=10001 loops=1)
               ->  Index Only Scan using dupfest2_a_idx on dupfest2 t1_2 
  (cost=0.29..194.30 rows=10001 width=2) (actual time=0.011..1.513 
rows=10001 loops=1)
               ->  Index Only Scan using dupfest3_a_idx on dupfest3 t1_3 
  (cost=0.29..194.30 rows=10001 width=2) (actual time=0.011..1.408 
rows=10001 loops=1)
         ->  Materialize  (cost=0.86..807.92 rows=30003 width=2) (actual 
time=0.014..13787.902 rows=300050003 loops=1)
               ->  Append  (cost=0.86..732.91 rows=30003 width=2) 
(actual time=0.011..4.225 rows=30003 loops=1)
                     ->  Index Only Scan using dupfest1_a_idx on 
dupfest1 t2_1  (cost=0.29..194.30 rows=10001 width=2) (actual 
time=0.010..0.698 rows=10001 loops=1)
                     ->  Index Only Scan using dupfest2_a_idx on 
dupfest2 t2_2  (cost=0.29..194.30 rows=10001 width=2) (actual 
time=0.005..0.708 rows=10001 loops=1)
                     ->  Index Only Scan using dupfest3_a_idx on 
dupfest3 t2_3  (cost=0.29..194.30 rows=10001 width=2) (actual 
time=0.006..0.776 rows=10001 loops=1)
A very common case I have seen for partitionwise joins is some foreign 
key structure. There are several design paradigms, that make it common. 
If you join across some foreign key structure, the amount of tuples 
doesn't increase. Think the table definition of
create table dupfest (a text, id bigint not null generated always as 
identity, primary key (a, id), foreign key (a, id) references dupfest 
(a, id)) partition by range(a);
With the query
select count(*) from (select * from dupfest t1, dupfest t2 where t1.a = 
t2.a and t1.id = t2.id order by t1.a offset 0);
If we join alongside a foreign key from t1 to t2, we know that the join 
can't contain more tuples than t1 did. This may seem like a very special 
case, but it's as far as enable_partitionwise_join = true is concerned 
definitely a common use case.
If we remove the order condition these cases also have very similar 
performance behavior (The difference in time is less than the noise 
threshold of my benchmark (regardless of the amount of data as long as 
work_mem is sufficiently large).):
select count(*) from (select * from dupfest t1, dupfest t2 where t1.a
= t2.a and t1.id = t2.id offset 0);
If we enter 10 times the data, we see a only marginal difference in the 
cost. (And similar performance differences, unless I do the explain 
analyze, which ruins the time readings again.)
My second sentence just captured the mundane observation, if the join 
has significantly more tuples, than any base relation, the place of the 
(Merge) Append might be more relevant. If I join everything with a 
generate_series(1, 30000) I get more tuples to process.
I'd like to make one more side note about this example: The planner 
punishes the partitionwise join for having an extra node, that emits N 
rows (three Hash joins + Append vs two Appends + Hash Join). This plan 
is chosen because of the cpu_tuple_cost. I'm happy it picks the plan 
with the smaller memory footprint, but in my real world experience for a 
timing based approach the default cpu_tuple_cost tends to be too high to 
get a fair comparison between partitionwise and non partitionwise joins.
All the best
Arne
| From | Date | Subject | |
|---|---|---|---|
| Next Message | David Rowley | 2025-10-30 01:51:52 | Re: Fix bogus use of "long" in aset.c | 
| Previous Message | David E. Wheeler | 2025-10-30 00:49:35 | Re: abi-compliance-check failure due to recent changes to pg_{clear,restore}_{attribute,relation}_stats() |