| From: | Arne Roland <arne(dot)roland(at)malkut(dot)net> | 
|---|---|
| To: | Robert Haas <robertmhaas(at)gmail(dot)com> | 
| Cc: | Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Ashutosh Bapat <ashutosh(dot)bapat(dot)oss(at)gmail(dot)com>, Andrei Lepikhov <lepihov(at)gmail(dot)com>, Jakub Wartak <jakub(dot)wartak(at)enterprisedb(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Julien Rouhaud <rjuju123(at)gmail(dot)com>, Etsuro Fujita <etsuro(dot)fujita(at)gmail(dot)com>, Richard Guo <guofenglinux(at)gmail(dot)com> | 
| Subject: | Re: apply_scanjoin_target_to_paths and partitionwise join | 
| Date: | 2025-10-30 22:07:50 | 
| Message-ID: | d6920c52-0163-4c1f-a4cd-c0ca666f922e@malkut.net | 
| Views: | Whole Thread | Raw Message | Download mbox | Resend email | 
| Thread: | |
| Lists: | pgsql-hackers | 
On 2025-10-30 16:52, Robert Haas wrote:
> Have you localized the problem to cpu_tuple_cost specifically, vs.
> cpu_index_tuple_cost or cpu_operator_cost? I've generally found that I
> need to reduce random_page_cost and seq_page_cost significantly to
> avoid getting sequential scans when index scans would be more
> reasonable, but that goes in the opposite direction as what you
> suggest here, in that it brings the I/O and CPU costs closer together,
> whereas your suggestion would push them further apart. I remember that
> Kevin Grittner used to say that the default value of this parameter
> was bad, too, but he recommended*raising* it:
In these particular cases I looked at the cpu_tuple cost is the main 
issue, yes. I am not saying that the cpu_tuple_cost is the real culprit 
in general. I tune random_page_cost and seq_page cost frequently, but as 
you rarely need to worry about the different cpu parts. I remember one 
database where I lowered it to the default again, but that was a few 
years ago. I am no expert on cpu_tuple_cost generally.
On 2025-10-30 20:57, Robert Haas wrote:
> On Thu, Oct 30, 2025 at 11:52 AM Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
>> Right. Although that's the main thing here, I am inclined to suspect
>> there are other ways to hit this problem, maybe ways that are more
>> likely to happen in the real world, because...
> And just like that, I found another way that this can happen. Consider
> this query from the regression tests:
>
> SELECT t1.a, t1.c, t2.b, t2.c FROM prt1 t1, prt2 t2 WHERE t1.a = t2.a
> AND t1.a = t2.b ORDER BY t1.a, t2.b;
>
> Each of prt1 and prt2 have three partitions. Since t1.a = t2.a and
> t1.a = t2.b, the planner deduces that t2.a = t2.b. The only rows from
> t2 where a = b are in the first partition. The planner therefore
> estimates that if it just does a Merge Join between the two
> partitioned tables, the Merge Join will stop early. There are actually
> nine rows in t2 where a = b, but the planner's estimate is just three
> rows, so it's understandable that it will very quickly run out of rows
> on the t2 side of the join. Thus, in the non-partitionwise plan, while
> its estimated cost to scan t1 is 44.83 and its estimated cost to scan
> t2 is 5.55, the estimated cost of the join is only 7.99, reflecting
> the fact that it doesn't anticipate having to actually finish the scan
> of t1.
>
> Now, if it does a partitionwise join, it still picks a Merge Join for
> the prt1_p1/prt2_p1 sub-join, and that can still stop early. But for
> the prt1_p2/prt2_p2 and prt1_p3/prt2_p3 joins, it picks hash joins,
> which as far as the planner knows can't stop early, so there's no
> opportunity to get a "discount" on the cost of scanning any of those
> tables. As a result, the estimated cost of this plan ends up being
> 11.53, clearly more than the non-partitionwise estimated cost.
>
> In this case, the planner's methodology doesn't really make a lot of
> sense when you stop to think about it. If the planner is correct that
> the non-partitionwise join will stop early, then the hash joins that
> are chosen in the partitionwise scan will stop early, because the
> non-parallel hashjoin code notices when the hash table built for the
> inner side is empty and bails out without finishing the outer scan.
> But the planner code is understandably reluctant to bet on a table
> being completely empty for costing purposes. Performing a
> non-partitionwise join allows the planner to make an end run around
> this restriction: it can bet on the combined inner table ending up
> with no rows contributed by the second or third child tables without
> actually betting on any relation being completely empty.
>
> Consequently, it doesn't seem like this type of case can account for
> the original report of a massive real-world run time regression. The
> planner's mental gymnastics here cause it to think that a
> non-partitionwise plan will be faster, but as far as I can tell,
> there's no real reason to expect that it actually will be.
100%. I think such cases are common. I had once a dirty hack, that 
created several equivalent cases (most of them not even partitioning 
related) of sql queries planed them and tried to execute the ones with 
the lowest cost. I ran it on my OLAP benchmark back then and noticed, 
that the average overall execution time (not the gazillion times of 
additional planning) suffered from this. If we give the planner a lot of 
options, it has more chances to do bad estimates.
Quite often the estimations for provable cardinality equivalent sub 
paths are vastly different. It would be so immensely helpful if we could 
reason about that a bit better, but I have no idea how to do that.
Regards
Arne
| From | Date | Subject | |
|---|---|---|---|
| Next Message | David Rowley | 2025-10-30 22:41:03 | Re: Potential bug introduced in PG17 with query parallelization - plan flip | 
| Previous Message | Masahiko Sawada | 2025-10-30 22:07:39 | Re: Optimize SnapBuildPurgeOlderTxn: use in-place compaction instead of temporary array |