Re: Pathify RHS unique-ification for semijoin planning

From: Richard Guo <guofenglinux(at)gmail(dot)com>
To: Alexandra Wang <alexandra(dot)wang(dot)oss(at)gmail(dot)com>
Cc: Álvaro Herrera <alvherre(at)kurilemu(dot)de>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Andy Fan <zhihuifan1213(at)163(dot)com>, wenhui qiu <qiuwenhuifx(at)gmail(dot)com>
Subject: Re: Pathify RHS unique-ification for semijoin planning
Date: 2025-07-31 04:08:06
Message-ID: CAMbWs4-yHUZypwzLaX1Y4k1x+nvQcvfZGYFrm5eFAO7NaSuSbA@mail.gmail.com
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Thu, Jul 31, 2025 at 10:33 AM Alexandra Wang
<alexandra(dot)wang(dot)oss(at)gmail(dot)com> wrote:
> Thanks for the patch! I applied your patch and played around with it.

Thanks for trying out this patch.

> I have a question about the following test case you added in
> subselect.sql:

> I was under the impression that we wanted Unique on top of a sorted
> node for the inner of the SIMI join. However, the expected output uses
> a HashAgg instead. Is this expected?

Hmm, I don't think we have such expectation that "Sort+Unique" should
be used for the unique-ification step in this query. This patch
considers both hash-based and sort-based implementations, and lets the
cost comparison determine which one is preferable. So I wouldn't be
surprised if the planner ends up choosing the hash-based
implementation in the final plan.

> While looking at the code, I also had a question about the following
> changes in costsize.c:
>
> --- a/src/backend/optimizer/path/costsize.c
> +++ b/src/backend/optimizer/path/costsize.c
> @@ -3963,7 +3963,9 @@ final_cost_mergejoin(PlannerInfo *root, MergePath *path,
> * The whole issue is moot if we are working from a unique-ified outer
> * input, or if we know we don't need to mark/restore at all.
> */
> - if (IsA(outer_path, UniquePath) || path->skip_mark_restore)
> + if (IsA(outer_path, UniquePath) ||
> + IsA(outer_path, AggPath) ||
> + path->skip_mark_restore)
>
> and
>
> @@ -4358,7 +4360,7 @@ final_cost_hashjoin(PlannerInfo *root, HashPath *path,
> * because we avoid contaminating the cache with a value that's wrong for
> * non-unique-ified paths.
> */
> - if (IsA(inner_path, UniquePath))
> + if (IsA(inner_path, UniquePath) || IsA(inner_path, AggPath))
>
> I'm curious why AggPath was added in these two cases.

Well, in final_cost_hashjoin() and final_cost_mergejoin(), we have
some special cases when the inner or outer relation has been
unique-ified. Previously, it was sufficient to check whether the path
was a UniquePath, since both hash-based and sort-based implementations
were represented that way. However, with this patch, UniquePath now
only represents the sort-based implementation, so we also need to
check for AggPath to account for the hash-based case.

> The second diff looks fine to me. However, the first diff in the
> subselect test happened to be the test that I asked about.
>
> Here's the comparison of the two EXPLAIN ANALYZE results:

> While both runs are fast (28ms vs. 17ms), the plan generated by the v5
> patch is slower in this case.

This is a very interesting observation. In fact, with the original v5
patch, you can produce both plans by setting enable_hashagg on and
off.

set enable_hashagg to on;
Incremental Sort (cost=91.95..277.59 rows=2500 width=16)
(actual time=31.960..147.040 rows=90000.00 loops=1)

set enable_hashagg to off;
Merge Join (cost=70.14..294.34 rows=2500 width=16)
(actual time=4.303..71.891 rows=90000.00 loops=1)

This seems to be another case where the planner chooses a suboptimal
plan due to inaccurate cost estimates.

Thanks
Richard

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Peter Smith 2025-07-31 05:45:23 Re: [WIP]Vertical Clustered Index (columnar store extension) - take2
Previous Message Michael Paquier 2025-07-31 04:01:20 Re: track generic and custom plans in pg_stat_statements