ubsan

From: Andres Freund <andres(at)anarazel(dot)de>
To: pgsql-hackers(at)postgresql(dot)org, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Subject: ubsan
Date: 2022-03-23 17:35:37
Message-ID: 20220323173537.ll7klrglnp4gn2um@alap3.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

I tried to run postgres with ubsan to debug something. I ran into two main
issues:

1) Despite Tom's recent efforts in [1], I see two ubsan failures in
HEAD.

These are easy enough to fix as per the attached, although the fix for the
GetConfigOptionByNum() isn't great - we should probably pass a nulls array
to GetConfigOptionByNum() as well, but that'd have a bunch of followup
changes. So I'm inclined to go with the minimal for now.

2) When debugging issues I got very confused by the fact that *sometimes*
UBSAN_OPTIONS takes effect and sometimes not. I was trying to have ubsan
generate backtraces as well as a coredump with
UBSAN_OPTIONS="print_stacktrace=1:disable_coredump=0:abort_on_error=1:verbosity=2"

After a lot of debugging I figured out that the options took effect in
postmaster, but not in any children. Which in turn is because
set_ps_display() breaks /proc/$pid/environ - it's empty in all postmaster
children for me.

The sanitizer library use /proc/$pid/environ (from [2] to [3]), because
they don't want to use getenv() because it wants to work without libc
(whether that's the right way, i have my doubts). When just using
undefined and alignment sanitizers, the sanitizer library is only
initialized when the first error occurs, by which time we've often already
called set_ps_display().

Note that ps_status.c breaks /proc/$pid/environ even if the
update_process_title GUC is set to false, because init_ps_display() ignores
that. Yes, that confused me for a while last night.

The reason that /proc/$pid/environ is borken is fairly obvious: We
overwrite it in set_ps_display() in the PS_USE_CLOBBER_ARGV case. The
kernel just looks at the range set up originally, which we'll overwrite
with zeroes.

We could try telling the kernel about the new location of environ using
prctl() PR_SET_MM_ENV_START/END but the restrictions around that sound
problematic.

I've included a hacky workaround: Define __ubsan_default_options, a weak
symbol libsanitizer uses to get defaults from the application, and return
getenv("UBSAN_OPTIONS"). But only if main already was reached, so that we
don't end up relying on a not-yet-working getenv().

This also lead me to finally figure out why /proc/$pid/cmdline of postgres
children includes a lot of NULL bytes. I'd noticed this because tools like
pidstat -l started to print long long status strings at some point, before
it got fixed on the pidstat side.
The way we overwrite doesn't trigger this special case in the kernel:
https://github.com/torvalds/linux/blob/master/fs/proc/base.c#L296
therefore the original size of the commandline arguments are printed, with
a lot of boring zeroes.

A test run of this in ci, with the guc.c intentionally reintroduced, shows the
failure both via core dump
https://api.cirrus-ci.com/v1/task/6543164315009024/logs/cores.log
and
in postmaster's log: https://api.cirrus-ci.com/v1/artifact/task/6543164315009024/log/src/test/regress/log/postmaster.log

although that part is a bit annoying to read, because the error is
interspersed with other log messages:

guc.c:9801:15: runtime error: null pointer passed as argument 2, which is declared to never be null
==19253==Using libbacktrace symbolizer.
2022-03-23 17:20:35.412 UTC [19258][client backend] [pg_regress/alter_operator][14/429:0] ERROR: must be owner of operator ===
...
2022-03-23 17:20:35.601 UTC [19254][client backend] [pg_regress/alter_generic][10/1569:0] STATEMENT: ALTER STATISTICS alt_stat2 OWNER TO regress_alter_generic_user2;
#0 0x5626562b2ab4 in GetConfigOptionByNum /tmp/cirrus-ci-build/src/backend/utils/misc/guc.c:9801
#1 0x5626562b3fd5 in show_all_settings /tmp/cirrus-ci-build/src/backend/utils/misc/guc.c:10137
2022-03-23 17:20:35.604 UTC [19254][client backend] [pg_regress/alter_generic][10/1576:0] ERROR: must be owner of statistics object alt_stat3
...
2022-03-23 17:20:35.601 UTC [19254][client backend] [pg_regress/alter_generic][10/1569:0] STATEMENT: ALTER STATISTICS alt_stat2 OWNER TO regress_alter_generic_user2;
#0 0x5626562b2ab4 in GetConfigOptionByNum /tmp/cirrus-ci-build/src/backend/utils/misc/guc.c:9801
#1 0x5626562b3fd5 in show_all_settings /tmp/cirrus-ci-build/src/backend/utils/misc/guc.c:10137
2022-03-23 17:20:35.604 UTC [19254][client backend] [pg_regress/alter_generic][10/1576:0] ERROR: must be owner of statistics object alt_stat3
2022-03-23 17:20:35.604 UTC [19254][client backend] [pg_regress/alter_generic][10/1576:0] STATEMENT: ALTER STATISTICS alt_stat3 RENAME TO alt_stat4;
#2 0x562655c0ea86 in ExecMakeTableFunctionResult /tmp/cirrus-ci-build/src/backend/executor/execSRF.c:234
#3 0x562655c3f8be in FunctionNext /tmp/cirrus-ci-build/src/backend/executor/nodeFunctionscan.c:95
2022-03-23 17:20:35.605 UTC [19254][client backend] [pg_regress/alter_generic][10/1578:0] ERROR: must be owner of statistics object alt_stat3
2022-03-23 17:20:35.605 UTC [19254][client backend] [pg_regress/alter_generic][10/1578:0] STATEMENT: ALTER STATISTICS alt_stat3 OWNER TO regress_alter_generic_user2;
2022-03-23 17:20:35.606 UTC [19254][client backend] [pg_regress/alter_generic][10/1579:0] ERROR: must be member of role "regress_alter_generic_user3"
2022-03-23 17:20:35.606 UTC [19254][client backend] [pg_regress/alter_generic][10/1579:0] STATEMENT: ALTER STATISTICS alt_stat2 OWNER TO regress_alter_generic_user3;
#4 0x562655c10175 in ExecScanFetch /tmp/cirrus-ci-build/src/backend/executor/execScan.c:133
#5 0x562655c10653 in ExecScan /tmp/cirrus-ci-build/src/backend/executor/execScan.c:199
#6 0x562655c3f643 in ExecFunctionScan /tmp/cirrus-ci-build/src/backend/executor/nodeFunctionscan.c:270
2022-03-23 17:20:35.606 UTC [19254][client backend] [pg_regress/alter_generic][10/1580:0] ERROR: must be owner of statistics object alt_stat3
2022-03-23 17:20:35.606 UTC [19254][client backend] [pg_regress/alter_generic][10/1580:0] STATEMENT: ALTER STATISTICS alt_stat3 SET SCHEMA alt_nsp2;
2022-03-23 17:20:35.606 UTC [19254][client backend] [pg_regress/alter_generic][10/1581:0] ERROR: statistics object "alt_stat2" already exists in schema "alt_nsp2"
2022-03-23 17:20:35.606 UTC [19254][client backend] [pg_regress/alter_generic][10/1581:0] STATEMENT: ALTER STATISTICS alt_stat2 SET SCHEMA alt_nsp2;
#7 0x562655c09bc9 in ExecProcNodeFirst /tmp/cirrus-ci-build/src/backend/executor/execProcnode.c:463
#8 0x562655bf7580 in ExecProcNode ../../../src/include/executor/executor.h:259
#9 0x562655bf7580 in ExecutePlan /tmp/cirrus-ci-build/src/backend/executor/execMain.c:1633
#10 0x562655bf78b9 in standard_ExecutorRun /tmp/cirrus-ci-build/src/backend/executor/execMain.c:362

Greetings,

Andres Freund

[1] https://postgr.es/m/CALNJ-vT9r0DSsAOw9OXVJFxLENoVS_68kJ5x0p44atoYH+H4dg@mail.gmail.com
[2] https://gcc.gnu.org/git/?p=gcc.git;a=blob;f=libsanitizer/ubsan/ubsan_flags.cpp;h=9a66bd37518b3a0606049b761ffdd7ddf3c3c714;hb=refs/heads/master#l68
[2] https://gcc.gnu.org/git/?p=gcc.git;a=blob;f=libsanitizer/sanitizer_common/sanitizer_linux.cpp;h=aa59d9718ca89cc554bdf677df3e64ddd233ca59;hb=refs/heads/master#l559

Attachment Content-Type Size
v3-0001-configure-check-for-dlsym-not-just-dlopen.patch text/x-diff 2.8 KB
v3-0002-Hack-up-compatibility-between-ubsan-and-ps_status.patch text/x-diff 1.3 KB
v3-0003-Fix-two-ubsan-violations.patch text/x-diff 1.4 KB
v3-0004-ci-use-fsanitize-undefined-alignment-in-linux-tas.patch text/x-diff 1.9 KB

Responses

  • Re: ubsan at 2022-03-23 17:54:50 from Tom Lane

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2022-03-23 17:36:29 Re: Parameter for planner estimate of recursive queries
Previous Message alvherre@alvh.no-ip.org 2022-03-23 17:24:49 Re: [BUG] Panic due to incorrect missingContrecPtr after promotion