| From: | Robert Haas <robertmhaas(at)gmail(dot)com> | 
|---|---|
| To: | Arne Roland <arne(dot)roland(at)malkut(dot)net> | 
| Cc: | Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Ashutosh Bapat <ashutosh(dot)bapat(dot)oss(at)gmail(dot)com>, Andrei Lepikhov <lepihov(at)gmail(dot)com>, Jakub Wartak <jakub(dot)wartak(at)enterprisedb(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Julien Rouhaud <rjuju123(at)gmail(dot)com>, Etsuro Fujita <etsuro(dot)fujita(at)gmail(dot)com>, Richard Guo <guofenglinux(at)gmail(dot)com> | 
| Subject: | Re: apply_scanjoin_target_to_paths and partitionwise join | 
| Date: | 2025-10-30 19:57:39 | 
| Message-ID: | CA+TgmoafrY-=M=Dgr-sVkETCefa2yLFAVADTVLa=JKBwpjtLoA@mail.gmail.com | 
| Views: | Whole Thread | Raw Message | Download mbox | Resend email | 
| Thread: | |
| Lists: | pgsql-hackers | 
On Thu, Oct 30, 2025 at 11:52 AM Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
> Right. Although that's the main thing here, I am inclined to suspect
> there are other ways to hit this problem, maybe ways that are more
> likely to happen in the real world, because...
And just like that, I found another way that this can happen. Consider
this query from the regression tests:
SELECT t1.a, t1.c, t2.b, t2.c FROM prt1 t1, prt2 t2 WHERE t1.a = t2.a
AND t1.a = t2.b ORDER BY t1.a, t2.b;
Each of prt1 and prt2 have three partitions. Since t1.a = t2.a and
t1.a = t2.b, the planner deduces that t2.a = t2.b. The only rows from
t2 where a = b are in the first partition. The planner therefore
estimates that if it just does a Merge Join between the two
partitioned tables, the Merge Join will stop early. There are actually
nine rows in t2 where a = b, but the planner's estimate is just three
rows, so it's understandable that it will very quickly run out of rows
on the t2 side of the join. Thus, in the non-partitionwise plan, while
its estimated cost to scan t1 is 44.83 and its estimated cost to scan
t2 is 5.55, the estimated cost of the join is only 7.99, reflecting
the fact that it doesn't anticipate having to actually finish the scan
of t1.
Now, if it does a partitionwise join, it still picks a Merge Join for
the prt1_p1/prt2_p1 sub-join, and that can still stop early. But for
the prt1_p2/prt2_p2 and prt1_p3/prt2_p3 joins, it picks hash joins,
which as far as the planner knows can't stop early, so there's no
opportunity to get a "discount" on the cost of scanning any of those
tables. As a result, the estimated cost of this plan ends up being
11.53, clearly more than the non-partitionwise estimated cost.
In this case, the planner's methodology doesn't really make a lot of
sense when you stop to think about it. If the planner is correct that
the non-partitionwise join will stop early, then the hash joins that
are chosen in the partitionwise scan will stop early, because the
non-parallel hashjoin code notices when the hash table built for the
inner side is empty and bails out without finishing the outer scan.
But the planner code is understandably reluctant to bet on a table
being completely empty for costing purposes. Performing a
non-partitionwise join allows the planner to make an end run around
this restriction: it can bet on the combined inner table ending up
with no rows contributed by the second or third child tables without
actually betting on any relation being completely empty.
Consequently, it doesn't seem like this type of case can account for
the original report of a massive real-world run time regression. The
planner's mental gymnastics here cause it to think that a
non-partitionwise plan will be faster, but as far as I can tell,
there's no real reason to expect that it actually will be.
-- 
Robert Haas
EDB: http://www.enterprisedb.com
| From | Date | Subject | |
|---|---|---|---|
| Next Message | Robert Haas | 2025-10-30 20:05:19 | Re: another autovacuum scheduling thread | 
| Previous Message | Masahiko Sawada | 2025-10-30 19:10:42 | Re: Add uuid_to_base32hex() and base32hex_to_uuid() built-in functions |