Pathify RHS unique-ification for semijoin planning

From: Richard Guo <guofenglinux(at)gmail(dot)com>
To: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Pathify RHS unique-ification for semijoin planning
Date: 2025-05-22 07:05:55
Message-ID: CAMbWs4-EBnaRvEs7frTLbsXiweSTUXifsteF-d3rvv01FKO86w@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

I came across a query where the plan includes some unnecessary Sort
nodes. Here's an example that shows the issue.

create table t(a int, b int);
insert into t select i%100, i from generate_series(1,10000)i;
create index on t(a);
vacuum analyze t;

set enable_hashagg to off;

explain (costs off)
select * from t t1 where t1.a in
(select a from t t2 where a < 10)
order by t1.a;
QUERY PLAN
---------------------------------------------------------------
Merge Join
Merge Cond: (t1.a = t2.a)
-> Index Scan using t_a_idx on t t1
-> Sort
Sort Key: t2.a
-> Unique
-> Sort
Sort Key: t2.a
-> Index Only Scan using t_a_idx on t t2
Index Cond: (a < 10)
(10 rows)

I believe the two Sort nodes are unnecessary.

After some digging, it seems that this is related to one of the
approaches we use to implement semijoins: unique-ifying the RHS and
then doing a regular join. The unique-ification is handled in
create_unique_path(), which considers both hash-based and sort-based
implementations. However, in the case of sort-based implementation,
this function pays no attention to the subpath's pathkeys or the
pathkeys of the resulting output.

In addition to this specific issue, it seems to me that there are
other potential issues in create_unique_path().

* Currently, we only consider the cheapest_total_path of the RHS when
unique-ifying it. I think this may cause us to miss some optimization
opportunities. For example, there might be a path with a better sort
order that isn't the cheapest-total one. Such a path could help avoid
a sort at a higher level, potentially resulting in a cheaper overall
plan.

* In create_unique_path(), we currently rely on heuristics to decide
whether to use a hash-based or sort-based method. I think a better
approach would be to create paths for both methods and let add_path()
determine which one is better, similar to how we handle path selection
in other parts of the planner.

Therefore, I'm thinking that maybe we could create a new RelOptInfo
for the RHS rel to represent its unique-ified version, and then
generate all worthwhile paths for it, similar to how it's done in
create_distinct_paths(). Since this is likely to be called repeatedly
on the same rel, we can cache the new RelOptInfo in the rel struct,
just like how we cache cheapest_unique_path currently.

To be concrete, I'm envisioning something like the following:

if (bms_equal(sjinfo->syn_righthand, rel2->relids) &&
- create_unique_path(root, rel2, rel2->cheapest_total_path,
- sjinfo) != NULL)
+ (rel2_unique = create_unique_rel(root, rel2, sjinfo)) != NULL)

...

- add_paths_to_joinrel(root, joinrel, rel1, rel2,
- JOIN_UNIQUE_INNER, sjinfo,
+ add_paths_to_joinrel(root, joinrel, rel1, rel2_unique,
+ JOIN_INNER, sjinfo,
restrictlist);
- add_paths_to_joinrel(root, joinrel, rel2, rel1,
- JOIN_UNIQUE_OUTER, sjinfo,
+ add_paths_to_joinrel(root, joinrel, rel2_unique, rel1,
+ JOIN_INNER, sjinfo,
restrictlist);

In addition, by changing the join from "rel1" and "rel2" using
JOIN_UNIQUE_OUTER or JOIN_UNIQUE_INNER to a join between "rel1" and
"rel2_unique" using a standard JOIN_INNER, we might be able to get
rid of the JOIN_UNIQUE_OUTER and JOIN_UNIQUE_INNER jointypes. This
could simplify a lot of logic in joinpath.c, where we're increasingly
adding special-case handling for these two jointypes.

One last point, I doubt that the code related to UNIQUE_PATH_NOOP is
reachable in practice. create_unique_path() checks whether the input
rel is an RTE_RELATION or RTE_SUBQUERY and is provably unique, and
creates a UNIQUE_PATH_NOOP UniquePath if so. However, in that case,
the semijoin should have already been simplified to a plain inner join
by analyzejoins.c.

Any thoughts?

Thanks
Richard

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message jian he 2025-05-22 07:08:46 Re: pg_dump does not dump domain not-null constraint's comments
Previous Message Michael Paquier 2025-05-22 05:54:22 Retiring some encodings?