| From: | Jakub Wartak <jakub(dot)wartak(at)enterprisedb(dot)com> |
|---|---|
| To: | Robert Haas <robertmhaas(at)gmail(dot)com> |
| Cc: | Lukas Fittl <lukas(at)fittl(dot)com>, Jacob Champion <jacob(dot)champion(at)enterprisedb(dot)com>, Dian Fay <di(at)nmfay(dot)com>, Matheus Alcantara <matheusssilv97(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org> |
| Subject: | Re: pg_plan_advice |
| Date: | 2026-01-14 11:02:11 |
| Message-ID: | CAKZiRmz4zrZfx_wgzkTD88HzKiA22HKPUnubR3is0bjKra6Utg@mail.gmail.com |
| Views: | Whole Thread | Raw Message | Download mbox | Resend email |
| Thread: | |
| Lists: | pgsql-hackers |
On Mon, Jan 12, 2026 at 6:13 PM Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
[..]
> Here's v9. Changes:
OK, so I was thinking v9 is going to be pretty slick ride, however got
some issues inside 0005 :
1) with cassert/debug builds (meson setup build --prefix=/usr/pgsql19
--buildtype=debug -Dcassert=true) I've started getting nonalways
non-deterministic failures for TPC-H Q4 and Q8. That was a somewhat
self-dissolving error (happened fresh after data load when the testing
suite was launched rapidly afterwards), so I've tracked it down to the
autoanalyze gathering stats. So if load the the whole suite of data,
do not run analyze and stay with autovacuum=off (to avoid autoanalyze)
and run the testing query suite, it identified this failure to force
NL instead of HJ in q8 but also *uncovered* runtime error in q4 that
happens with no stats
a) q4.sql (please see attached file for repro). More or less: right
after import I get a hard failure if the earlier recommended advice is
enabled (smells like a bug to me: we shouldn't get any errors even if
advice is bad). This can be solved by ANALYZE, but brought up back by
truncating pg_statistics
ERROR: unique semijoin found for relids (b 3) but not observed during planning
STATEMENT: explain (timing off, costs off, settings off, memory off)
a) q8.sql (please see attached file for demo). It is even more
bizarre, happens right after import , fixed by ANALYZE, but even
TRUNCATING pg_statistic doesnt bring back the problem. Pinpointed that
additional pg_clear_relation_stats() triggers the problem back.
2) Somewhat in default buildtype debugoptimized (plain "meson setup
build --prefix=/usr/pgsql19") I'm getting crashes with v9 in
contrib/pg_plan_advice/pgpa_planner.c:pgpa_join_path_setup line 460,
full stack trace attached.
2026-01-14 10:54:04.718 CET [97138] LOG: client backend (PID 97408)
was terminated by signal 11: Segmentation fault
2026-01-14 10:54:04.718 CET [97138] DETAIL: Failed process was
running: explain (timing off, costs off, settings off, memory off)
SELECT
s_name,
s_address
[..]
AND s_nationkey = n_nationkey
AND n_name = 'CANADA'
ORDER BY
s_name;
2026-01-14 10:54:04.718 CET [97138] LOG: terminating any other active
server processes
To me it looks like "pps" is NULL and hits "if
(pps->generate_advice_string)" because
GetPlannerGlobalExtensionState() returns NULL because
root->glob->extension_state_allocated is 0 (while planner_extension_id
is also 0). Crash is only happening for q20 and q4, till I've tried
the below fixup which seems to solve it (?) - it's just based on the
fact that all other uses of GetPlannerGlobalExtensionState() seem to
check for NULL:
- if (pps->generate_advice_string)
+ if (pps != NULL && pps->generate_advice_string)
3) Also so I went ahead runnning the full suite (without and with
ANALYZE statistics) with asan, so with CFLAGS="-O2 -g -ggdb
-fno-sanitize-recover=all -fsanitize=address" and
ASAN_OPTIONS=detect_leaks=0:abort_on_error=1:print_stacktrace=1:disable_coredump=0:strict_string_checks=1:check_initialization_order=1:strict_init_order=1:detect_stack_use_after_return=0
(the last option seem to be critical to avoid hitting max_stack_depth
issues on my gcc - XXX marker here). So it did catch previous issue,
e.g:
[..all queries running fine..]
q19.sql
q1.sql
q20.sql
AddressSanitizer:DEADLYSIGNAL
=================================================================
==153072==ERROR: AddressSanitizer: SEGV on unknown address
0x000000000008 (pc 0x7de84f7822cc bp 0x7ffda327e420 sp 0x7ffda327e130
T0)
==153072==The signal is caused by a READ memory access.
==153072==Hint: address points to the zero page.
#0 0x7de84f7822cc in pgpa_join_path_setup
../contrib/pg_plan_advice/pgpa_planner.c:460
#1 0x64bff4d5966d in add_paths_to_joinrel
../src/backend/optimizer/path/joinpath.c:180
#2 0x64bff4d60a8a in populate_joinrel_with_paths
../src/backend/optimizer/path/joinrels.c:1197
#3 0x64bff4d63377 in make_join_rel
../src/backend/optimizer/path/joinrels.c:774
[..]
but with the above fixup all seems to clean in multiple scenarios
(stats, no stats) and including basic "ninja test".
3b) XXX - marker:I was looking for a solution and apparently cfbot
farm has those options, so they should be testing it anyway. And this
brings me to a fact, that it maybe could be detected by cfbot, however
the $thread is not registered so cfbot had no chance to see what's
more there? (I'm mainly thinking about any cross-platform issues, if
any).
-J.
| Attachment | Content-Type | Size |
|---|---|---|
| q4_errors_hard_without_stats.txt | text/plain | 7.2 KB |
| q8_nl_instead_of_hj.txt | text/plain | 14.0 KB |
| pgpa_join_path_setup_crash.txt | text/plain | 14.3 KB |
| From | Date | Subject | |
|---|---|---|---|
| Next Message | Jakub Wartak | 2026-01-14 11:14:35 | Re: [PATCH] check kernel version for io_method |
| Previous Message | Hannu Krosing | 2026-01-14 10:52:54 | Re: Patch: dumping tables data in multiple chunks in pg_dump |