Quick Links

Re: Pathify RHS unique-ification for semijoin planning

From:	Alexandra Wang <alexandra(dot)wang(dot)oss(at)gmail(dot)com>
To:	Richard Guo <guofenglinux(at)gmail(dot)com>
Cc:	Álvaro Herrera <alvherre(at)kurilemu(dot)de>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Andy Fan <zhihuifan1213(at)163(dot)com>, wenhui qiu <qiuwenhuifx(at)gmail(dot)com>
Subject:	Re: Pathify RHS unique-ification for semijoin planning
Date:	2025-08-01 14:58:01
Message-ID:	CAK98qZ3qGjCEZAyeOCwQFY-Y4CMkPjN83xkgij1yx76R1=vBvw@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

Hi Richard,

On Wed, Jul 30, 2025 at 9:08 PM Richard Guo <guofenglinux(at)gmail(dot)com> wrote:

> On Thu, Jul 31, 2025 at 10:33 AM Alexandra Wang
> <alexandra(dot)wang(dot)oss(at)gmail(dot)com> wrote:
> > Thanks for the patch! I applied your patch and played around with it.
>
> Thanks for trying out this patch.
>
> > I have a question about the following test case you added in
> > subselect.sql:
>
> > I was under the impression that we wanted Unique on top of a sorted
> > node for the inner of the SIMI join. However, the expected output uses
> > a HashAgg instead. Is this expected?
>
> Hmm, I don't think we have such expectation that "Sort+Unique" should
> be used for the unique-ification step in this query. This patch
> considers both hash-based and sort-based implementations, and lets the
> cost comparison determine which one is preferable. So I wouldn't be
> surprised if the planner ends up choosing the hash-based
> implementation in the final plan.
>
> > While looking at the code, I also had a question about the following
> > changes in costsize.c:
> >
> > --- a/src/backend/optimizer/path/costsize.c
> > +++ b/src/backend/optimizer/path/costsize.c
> > @@ -3963,7 +3963,9 @@ final_cost_mergejoin(PlannerInfo *root, MergePath
> *path,
> > * The whole issue is moot if we are working from a unique-ified outer
> > * input, or if we know we don't need to mark/restore at all.
> > */
> > - if (IsA(outer_path, UniquePath) || path->skip_mark_restore)
> > + if (IsA(outer_path, UniquePath) ||
> > + IsA(outer_path, AggPath) ||
> > + path->skip_mark_restore)
> >
> > and
> >
> > @@ -4358,7 +4360,7 @@ final_cost_hashjoin(PlannerInfo *root, HashPath
> *path,
> > * because we avoid contaminating the cache with a value that's wrong
> for
> > * non-unique-ified paths.
> > */
> > - if (IsA(inner_path, UniquePath))
> > + if (IsA(inner_path, UniquePath) || IsA(inner_path, AggPath))
> >
> > I'm curious why AggPath was added in these two cases.
>
> Well, in final_cost_hashjoin() and final_cost_mergejoin(), we have
> some special cases when the inner or outer relation has been
> unique-ified. Previously, it was sufficient to check whether the path
> was a UniquePath, since both hash-based and sort-based implementations
> were represented that way. However, with this patch, UniquePath now
> only represents the sort-based implementation, so we also need to
> check for AggPath to account for the hash-based case.
>
> > The second diff looks fine to me. However, the first diff in the
> > subselect test happened to be the test that I asked about.
> >
> > Here's the comparison of the two EXPLAIN ANALYZE results:
>
> > While both runs are fast (28ms vs. 17ms), the plan generated by the v5
> > patch is slower in this case.
>
> This is a very interesting observation. In fact, with the original v5
> patch, you can produce both plans by setting enable_hashagg on and
> off.
>
> set enable_hashagg to on;
> Incremental Sort (cost=91.95..277.59 rows=2500 width=16)
> (actual time=31.960..147.040 rows=90000.00 loops=1)
>
> set enable_hashagg to off;
> Merge Join (cost=70.14..294.34 rows=2500 width=16)
> (actual time=4.303..71.891 rows=90000.00 loops=1)
>
> This seems to be another case where the planner chooses a suboptimal
> plan due to inaccurate cost estimates.
>

Thanks for explaining! It makes sense now!

While reviewing the code, the following diff concerns me:

/*
* If the joinrel is parallel-safe, we may be able to consider a partial
- * merge join. However, we can't handle JOIN_UNIQUE_OUTER, because the
- * outer path will be partial, and therefore we won't be able to properly
- * guarantee uniqueness. Similarly, we can't handle JOIN_FULL, JOIN_RIGHT
- * and JOIN_RIGHT_ANTI, because they can produce false null extended rows.
+ * merge join. However, we can't handle JOIN_FULL, JOIN_RIGHT and
+ * JOIN_RIGHT_ANTI, because they can produce false null extended rows.
* Also, the resulting path must not be parameterized.
*/
if (joinrel->consider_parallel &&
- save_jointype != JOIN_UNIQUE_OUTER &&
- save_jointype != JOIN_FULL &&
- save_jointype != JOIN_RIGHT &&
- save_jointype != JOIN_RIGHT_ANTI &&
+ jointype != JOIN_FULL &&
+ jointype != JOIN_RIGHT &&
+ jointype != JOIN_RIGHT_ANTI &&
outerrel->partial_pathlist != NIL &&
bms_is_empty(joinrel->lateral_relids))

What has changed that enabled JOIN_UNIQUE_OUTER to handle partial
paths? Or is it no longer possible for the outer path to be partial?

Best,
Alex

In response to

Re: Pathify RHS unique-ification for semijoin planning at 2025-07-31 04:08:06 from Richard Guo

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Zhijie Hou (Fujitsu)	2025-08-01 15:03:26	RE: Add support for specifying tables in pg_createsubscriber.
Previous Message	Tender Wang	2025-08-01 14:45:20	Re: A little cosmetic to convert_VALUES_to_ANY()