Re: Pathify RHS unique-ification for semijoin planning

From: Richard Guo <guofenglinux(at)gmail(dot)com>
To: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Pathify RHS unique-ification for semijoin planning
Date: 2025-07-01 02:57:44
Message-ID: CAMbWs49+7XSEEBdP3Lyu34FUjp1jYLd3k7vfPSqg-GzJZ5H8Lw@mail.gmail.com
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Tue, Jun 3, 2025 at 4:52 PM Richard Guo <guofenglinux(at)gmail(dot)com> wrote:
> Here is an updated version of the patch, which is ready for review.
> I've fixed a cost estimation issue, improved some comments, and added
> a commit message. Nothing essential has changed.

This patch does not apply anymore, and here is a new rebase.

It may be argued that this patch introduces additional planning
overhead by considering multiple unique-ification paths for the RHS.
While that is true to some extent, I don't think this is a problem.
Please bear with me a moment.

* The additional path generation only occurs in specific semijoin
cases where one input rel is exactly the RHS. Queries without such
semijoins are not affected.

* This patch only considers the cheapest total path and presorted
paths from the original RHS. These are typically few in number, and
each has a high likelihood of contributing to a lower overall cost for
the final plan. I think the cost-benefit trade-off is worthwhile.

* This patch follows the convention in joinpath.c of exploring
alternative join input paths, rather than introducing novel overhead.
For example, when planning (A SEMIJOIN B), the planner considers
multiple paths from B, including its cheapest total path and any paths
with useful sort orders. There is no clear reason why, in the
analogous case of (A INNERJOIN unique-ified(B)), we should restrict
ourselves to only one path from the unique-ified RHS.

* On the other hand, if we insist on considering only a single path
from the unique-ified RHS, we face a dilemma when the hash-based
implementation has a cheaper total cost, but the sort-based
implementation has a better sort order. In such cases, what should
our selection criteria be? Currently, create_unique_path() simply
compares total_cost to choose the cheaper one (even without applying a
fuzz factor), and ignores sort order entirely. I don't think this
approach makes sense. It is also inconsistent with the general
pathification framework, where we rely on add_path() to retain the
best path or set of paths based on cost and other metrics, rather than
using such simple heuristics.

Another point I'd like to mention is that this patch removes the
UNIQUE_PATH_NOOP related code along the way, because I think it's dead
code. If the RHS rel is provably unique, the semijoin should have
already been simplified to a plain inner join by analyzejoins.c.
However, I might be overlooking something, and I'd appreciate any
feedback or corrections.

Thanks
Richard

Attachment Content-Type Size
v3-0001-Pathify-RHS-unique-ification-for-semijoin-plannin.patch application/octet-stream 105.5 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Peter Smith 2025-07-01 03:17:54 Re: [WIP]Vertical Clustered Index (columnar store extension) - take2
Previous Message Etsuro Fujita 2025-07-01 02:55:29 Problem with transition tables on partitioned tables with foreign-table partitions