Re: pg_plan_advice

From: Jakub Wartak <jakub(dot)wartak(at)enterprisedb(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Lukas Fittl <lukas(at)fittl(dot)com>, Jacob Champion <jacob(dot)champion(at)enterprisedb(dot)com>, Dian Fay <di(at)nmfay(dot)com>, Matheus Alcantara <matheusssilv97(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: pg_plan_advice
Date: 2026-01-15 12:28:30
Message-ID: CAKZiRmznM4cc_N5qQBgEdDj9F5J8=zTJbWm9erZzhp9jP7LMbg@mail.gmail.com
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Wed, Jan 14, 2026 at 11:11 PM Robert Haas <robertmhaas(at)gmail(dot)com> wrote:

Hi Robert,

> On Wed, Jan 14, 2026 at 6:02 AM Jakub Wartak
> <jakub(dot)wartak(at)enterprisedb(dot)com> wrote:
> > a) q4.sql (please see attached file for repro). More or less: right
> > after import I get a hard failure if the earlier recommended advice is
> > enabled (smells like a bug to me: we shouldn't get any errors even if
> > advice is bad). This can be solved by ANALYZE, but brought up back by
> > truncating pg_statistics
> > ERROR: unique semijoin found for relids (b 3) but not observed during planning
> > STATEMENT: explain (timing off, costs off, settings off, memory off)
>
> Hmm, so the plan tree walker thinks that we did a semijoin between
> lineitem and orders by making lineitem unique on the join column and
> then performing a regular join. That appears to be correct. But
> pgpa_join_path_setup never created a pgpa_join_path_setup for that
> possibility, or created one that doesn't actually match up properly to
> what was found in the plan tree. Can you check whether a
> pgpa_sj_unique_rel gets created in pgpa_join_path_setup, and with what
> contents?

OK, so today, on just barebone v9 (even without any fixes from this $subthread),
I couldn't get it to reproduce right out of the box right on the fresh
cluster. It
appears to another missing piece of the puzzle was to have
max_parallel_workers_per_gather=0 (in addition to TRUNCATING pg_statistic),
because otherwise it did not want to generate advice out of the box that would
generate this specific ERROR.

Maybe I'm big rookie here (OR just dumb), but it took me some time to realize
why we emit SEMIJOIN_UNIQUE() there, clearly the plan without parallelism
has "Nested Loop", not like "Nested Loop Semi Join"
(with max_parallel_workers_per_gather = 2). Yet it emits that and somehow
later the query feature walker->query_features[PGPAQF_SEMIJOIN_UNIQUE]
also is there, so that "unique semijoin found for.." error could be thrown.

As per VERBOSE explain, one can spot this SemiJoin transformation is being
applied (Nested Loop/Inner Unique: true), however sj_unique_rtis is empty (?!):

[..]
NOTICE: jointype=outer pps_NULL?=0
NOTICE: added SEMIJOIN_UNIQUE
NOTICE: pgpa_plan_walker: walking over SEMIJOIN_UNIQUE features: 3,
sj_unique_rtis=<> sj_unique_rels=<>
ERROR: unique semijoin found for relids (b 3) but not observed during planning

Only *after* this, I've realized how hard all of that is reading comments nearby
`typedef struct pgpa_sj_unique_rel`.

Anyway it appears that pgpa_plan_walker()/pgpa_planner_walker() is not
having proper
input information to begin with about SJs? It looks there is just one
single place that
sets pps->sj_unique_rels (lappend() in pgpa_join_path_setup()), but
that's code path is
only being launched when requesting explain is asking for advice:

explain (costs off, plan_advice) SELECT
[..]
NOTICE: jointype=inner pps_NULL?=0
NOTICE: found=0
NOTICE: not a duplicate, appending "(b 3)" to pps->sj_unique_rels
NOTICE: jointype=outer pps_NULL?=0
NOTICE: found=true! (ur->plan_name=(null) bms_ur_relids=3)
NOTICE: found=1
NOTICE: jointype=inner pps_NULL?=0
NOTICE: found=true! (ur->plan_name=(null) bms_ur_relids=3)
NOTICE: found=1
NOTICE: jointype=outer pps_NULL?=0
NOTICE: found=true! (ur->plan_name=(null) bms_ur_relids=3)
NOTICE: found=1
NOTICE: added SEMIJOIN_UNIQUE
WARNING: could not dump unrecognized node type: 0 // ignore?
NOTICE: pgpa_plan_walker: walking over SEMIJOIN_UNIQUE features: 3,
sj_unique_rtis=((b 3)) sj_unique_rels=({})
(+ no error!)

while with basic EXPLAIN (and advices planner/advises touching SJ
transforms), I'm getting:

dbt3=# explain (costs off) SELECT
[..]
NOTICE: jointype=inner pps_NULL?=0
NOTICE: jointype=outer pps_NULL?=0
NOTICE: jointype=inner pps_NULL?=0
NOTICE: jointype=outer pps_NULL?=0
NOTICE: added SEMIJOIN_UNIQUE
NOTICE: pgpa_plan_walker: walking over SEMIJOIN_UNIQUE features: 3,
sj_unique_rtis=<> sj_unique_rels=<>
ERROR: unique semijoin found for relids (b 3) but not observed during planning

So we have started v9 with:
if (pps->generate_advice_string) { -- but that's wrong due to
potential crash in -02 builds + asan complaints

we fixed that above bug with:
if (pps != NULL && pps->generate_advice_string) { -- but that's
wrong due to not initializing SJ for normal explains

so we end up doing simply this?
if (pps != NULL) {

The last one seems to pass all my tests (with already provided fixup
from yesterday), but I'm absolutely not sure if that's the proper way to
address that).

> > a) q8.sql (please see attached file for demo). It is even more
> > bizarre, happens right after import , fixed by ANALYZE, but even
> > TRUNCATING pg_statistic doesnt bring back the problem. Pinpointed that
> > additional pg_clear_relation_stats() triggers the problem back.
>
> I found this one. I now think that
> pgpa_planner_apply_join_path_advice() shouldn't added anything to
> jo_permit_indexes when a join method hint implicitly permits a join
> order. I simplified your test case to this:
>
> set pg_plan_advice.advice = 'JOIN_ORDER(n1 region customer)
> NESTED_LOOP_PLAIN(region)';
> explain (costs off, plan_advice)
> SELECT
> n1.n_name AS nation
> FROM
> customer,
> nation n1,
> region
> WHERE
> c_nationkey = n1.n_nationkey
> AND n1.n_regionkey = r_regionkey
> AND r_name = 'AMERICA';
>
> What was happening here is that when we considered a join between
> {customer, nation} and region, pgpa_planner_apply_join_path_advice()
> said, well, according to the JOIN_ORDER advice, this join order is not
> allowed, which is correct. And, according to the NESTED_LOOP_PLAIN
> advice, this join order is allowed, which is also correct, because
> NESTED_LOOP_PLAIN(region) denies join orders where region is the
> driving table, since those would make it impossible to respect the
> advice, and this join order doesn't do that. Then, it concludes that
> because one piece of advice says the join order is OK and the other
> says it isn't, the advice conflicts. This is where I think it's going
> off the rails: the NESTED_LOOP_PLAIN() advice should only be allowed
> to act as a negative constraint, not a positive one. So what I did is:

Yes! 3 three lines patches seems to help (and causes no other problems to
best of my knowledge). The simplified test case results seem to be only
changing like below, but it really fixes the Q8 NL->HJ.

Supplied Plan Advice:
- JOIN_ORDER(n1 region customer) /* matched, conflicting */
- NESTED_LOOP_PLAIN(region) /* matched, conflicting */
+ JOIN_ORDER(n1 region customer) /* matched */
+ NESTED_LOOP_PLAIN(region) /* matched */

BTW: I have found also that it fixes another (not yet here disclosed bug,
because I've found it just today :): when running without stats, and
with enable_nestloop=OFF (globally) Q5 was failing too due some
sequence of HJ/Parallel HJ
and slightly different Hash Cond) - nvm, wIth this patch it does NOT misbehave.

Maybe it would be good to include that into tests inside 0005, for that
small tiny query above?

> > 3b) XXX - marker:I was looking for a solution and apparently cfbot
> > farm has those options, so they should be testing it anyway. And this
> > brings me to a fact, that it maybe could be detected by cfbot, however
> > the $thread is not registered so cfbot had no chance to see what's
> > more there? (I'm mainly thinking about any cross-platform issues, if
> > any).
>
> I mean, there is https://commitfest.postgresql.org/patch/6184/

Whoops, mea culpa, I was looking for PG-4 commitfest for some reason
(so I should
be looking on https://cfbot.cputube.org/next.html not just under "/"
[main] one).

-J.

Attachment Content-Type Size
q4_errors_hard_without_stats_and_no_parallel_v2.txt text/plain 24.4 KB
q4_debugging_patch_aid.txt text/plain 4.5 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message zengman 2026-01-15 13:02:42 [PATCH] backup: Fix trivial typo and error message issues
Previous Message Bertrand Drouvot 2026-01-15 12:18:13 Re: Flush some statistics within running transactions