Quick Links

Re: Pathify RHS unique-ification for semijoin planning

From:	wenhui qiu <qiuwenhuifx(at)gmail(dot)com>
To:	Richard Guo <guofenglinux(at)gmail(dot)com>
Cc:	Alexandra Wang <alexandra(dot)wang(dot)oss(at)gmail(dot)com>, Álvaro Herrera <alvherre(at)kurilemu(dot)de>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Andy Fan <zhihuifan1213(at)163(dot)com>
Subject:	Re: Pathify RHS unique-ification for semijoin planning
Date:	2025-07-31 12:49:38
Message-ID:	CAGjGUALMmJySHX=PdbNooCZ0u-VO6aiDHgodS0YPxMQ1xdn-aw@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

> This is a very interesting observation. In fact, with the original v5
> patch, you can produce both plans by setting enable_hashagg on and
> off.

> set enable_hashagg to on;
> Incremental Sort (cost=91.95..277.59 rows=2500 width=16)
> (actual time=31.960..147.040 rows=90000.00 loops=1)
>
> set enable_hashagg to off;
> Merge Join (cost=70.14..294.34 rows=2500 width=16)
> (actual time=4.303..71.891 rows=90000.00 loops=1)
>
> This seems to be another case where the planner chooses a suboptimal
> plan due to inaccurate cost estimates.
Agree ,I increased some rows , set enable_hashagg to on and off ,There's no
difference in execution time. The execution plan is the same.

> #define IS_UNIQUEIFIED_REL(rel, sjinfo, nominal_jointype) \
> ((nominal_jointype) == JOIN_INNER && (sjinfo)->jointype == JOIN_SEMI
&& \
> bms_equal((sjinfo)->syn_righthand, (rel)->relids))
>
>... and then the check in final_cost_hashjoin() becomes:
>
> if (IS_UNIQUEIFIED_REL(inner_path->parent, extra->sjinfo,
> path->jpath.jointype))
> {
> innerbucketsize = 1.0 / virtualbuckets;
> innermcvfreq = 0.0;
> }
>
>Would this be a better approach? Any thoughts?

This approach does indeed more accurately capture the fact that the
relation has been unique-ified, especially in cases where a semi join has
been transformed into an inner join. Compared to the current heuristic
checks in costsize.c that rely on inner_path->rows, this method is more
semantically meaningful and robust.

On Thu, Jul 31, 2025 at 4:58 PM Richard Guo <guofenglinux(at)gmail(dot)com> wrote:

> On Thu, Jul 31, 2025 at 1:08 PM Richard Guo <guofenglinux(at)gmail(dot)com>
> wrote:
> > On Thu, Jul 31, 2025 at 10:33 AM Alexandra Wang
> > <alexandra(dot)wang(dot)oss(at)gmail(dot)com> wrote:
> > > While looking at the code, I also had a question about the following
> > > changes in costsize.c:
> > >
> > > --- a/src/backend/optimizer/path/costsize.c
> > > +++ b/src/backend/optimizer/path/costsize.c
> > > @@ -3963,7 +3963,9 @@ final_cost_mergejoin(PlannerInfo *root,
> MergePath *path,
> > > * The whole issue is moot if we are working from a unique-ified outer
> > > * input, or if we know we don't need to mark/restore at all.
> > > */
> > > - if (IsA(outer_path, UniquePath) || path->skip_mark_restore)
> > > + if (IsA(outer_path, UniquePath) ||
> > > + IsA(outer_path, AggPath) ||
> > > + path->skip_mark_restore)
> > >
> > > and
> > >
> > > @@ -4358,7 +4360,7 @@ final_cost_hashjoin(PlannerInfo *root, HashPath
> *path,
> > > * because we avoid contaminating the cache with a value that's wrong
> for
> > > * non-unique-ified paths.
> > > */
> > > - if (IsA(inner_path, UniquePath))
> > > + if (IsA(inner_path, UniquePath) || IsA(inner_path, AggPath))
> > >
> > > I'm curious why AggPath was added in these two cases.
>
> > Well, in final_cost_hashjoin() and final_cost_mergejoin(), we have
> > some special cases when the inner or outer relation has been
> > unique-ified. Previously, it was sufficient to check whether the path
> > was a UniquePath, since both hash-based and sort-based implementations
> > were represented that way. However, with this patch, UniquePath now
> > only represents the sort-based implementation, so we also need to
> > check for AggPath to account for the hash-based case.
>
> BTW, maybe a better way to determine whether a relation has been
> unique-ified is to check that the nominal jointype is JOIN_INNER and
> sjinfo->jointype is JOIN_SEMI, and the relation is exactly the RHS of
> the semijoin. This approach is mentioned in a comment in joinpath.c:
>
> * Path cost estimation code may need to recognize that it's
> * dealing with such a case --- the combination of nominal jointype INNER
> * with sjinfo->jointype == JOIN_SEMI indicates that.
>
> ... but it seems we don't currently apply it in costsize.c.
>
> To be concrete, I'm imagining a check like the following:
>
> #define IS_UNIQUEIFIED_REL(rel, sjinfo, nominal_jointype) \
> ((nominal_jointype) == JOIN_INNER && (sjinfo)->jointype == JOIN_SEMI
> && \
> bms_equal((sjinfo)->syn_righthand, (rel)->relids))
>
> ... and then the check in final_cost_hashjoin() becomes:
>
> if (IS_UNIQUEIFIED_REL(inner_path->parent, extra->sjinfo,
> path->jpath.jointype))
> {
> innerbucketsize = 1.0 / virtualbuckets;
> innermcvfreq = 0.0;
> }
>
> Would this be a better approach? Any thoughts?
>
> Thanks
> Richard
>

In response to

Re: Pathify RHS unique-ification for semijoin planning at 2025-07-31 08:58:28 from Richard Guo

Responses

Re: Pathify RHS unique-ification for semijoin planning at 2025-08-01 01:53:37 from Richard Guo

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Peter Eisentraut	2025-07-31 13:08:13	Convert varatt.h macros to static inline functions
Previous Message	Hayato Kuroda (Fujitsu)	2025-07-31 12:00:16	RE: POC: enable logical decoding when wal_level = 'replica' without a server restart