Re: Strange failure on mamba

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Thomas Munro <thomas(dot)munro(at)gmail(dot)com>
Cc: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Strange failure on mamba
Date: 2022-11-30 01:44:34
Message-ID: 1254382.1669772674@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

I wrote:
> Thomas Munro <thomas(dot)munro(at)gmail(dot)com> writes:
>> On Fri, Nov 18, 2022 at 11:08 AM Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>>> mamba has been showing intermittent failures in various replication
>>> tests since day one.

>> I wonder if it's a runtime variant of the other problem. We do
>> load_file("libpqwalreceiver", false) before unblocking signals but
>> maybe don't resolve the symbols until calling them, or something like
>> that...

> Yeah, that or some other NetBSD bug could be the explanation, too.
> Without a stack trace it's hard to have any confidence about it,
> but I've been unable to reproduce the problem outside the buildfarm.

Thanks to commit 51b5834cd I've now been able to capture some info
from mamba's last couple of failures [1][2]. Sure enough, what is
happening is that postmaster children are getting stuck in recursive
rtld symbol resolution. A couple of the stack traces I collected are

#0 0xfdeede4c in ___lwp_park60 () from /usr/libexec/ld.elf_so
#1 0xfdee3e08 in _rtld_exclusive_enter () from /usr/libexec/ld.elf_so
#2 0xfdee59e4 in dlopen () from /usr/libexec/ld.elf_so
#3 0x01e54ed0 in internal_load_library (
libname=libname(at)entry=0xfd74cc88 "/home/buildfarm/bf-data/HEAD/pgsql.build/tmp_install/home/buildfarm/bf-data/HEAD/inst/lib/postgresql/libpqwalreceiver.so") at dfmgr.c:239
#4 0x01e55c78 in load_file (filename=<optimized out>, restricted=<optimized out>) at dfmgr.c:156
#5 0x01c5ba24 in WalReceiverMain () at walreceiver.c:292
#6 0x01c090f8 in AuxiliaryProcessMain (auxtype=auxtype(at)entry=WalReceiverProcess) at auxprocess.c:161
#7 0x01c10970 in StartChildProcess (type=WalReceiverProcess) at postmaster.c:5310
#8 0x01c123ac in MaybeStartWalReceiver () at postmaster.c:5475
#9 MaybeStartWalReceiver () at postmaster.c:5468
#10 sigusr1_handler (postgres_signal_arg=<optimized out>) at postmaster.c:5131
#11 <signal handler called>
#12 0xfdee6b44 in _rtld_symlook_obj () from /usr/libexec/ld.elf_so
#13 0xfdee6fc0 in _rtld_symlook_list () from /usr/libexec/ld.elf_so
#14 0xfdee7644 in _rtld_symlook_default () from /usr/libexec/ld.elf_so
#15 0xfdee795c in _rtld_find_symdef () from /usr/libexec/ld.elf_so
#16 0xfdee7ad0 in _rtld_find_plt_symdef () from /usr/libexec/ld.elf_so
#17 0xfdee1918 in _rtld_bind () from /usr/libexec/ld.elf_so
#18 0xfdee1dc0 in _rtld_bind_secureplt_start () from /usr/libexec/ld.elf_so
Backtrace stopped: frame did not save the PC

#0 0xfdeede4c in ___lwp_park60 () from /usr/libexec/ld.elf_so
#1 0xfdee3e08 in _rtld_exclusive_enter () from /usr/libexec/ld.elf_so
#2 0xfdee4ba4 in _rtld_exit () from /usr/libexec/ld.elf_so
#3 0xfd54ea74 in __cxa_finalize () from /usr/lib/libc.so.12
#4 0xfd54e354 in exit () from /usr/lib/libc.so.12
#5 0x01c963c0 in proc_exit (code=code(at)entry=0) at ipc.c:152
#6 0x01c056e4 in AutoVacLauncherShutdown () at autovacuum.c:853
#7 0x01c071dc in AutoVacLauncherMain (argv=0x0, argc=0) at autovacuum.c:800
#8 0x01c07694 in StartAutoVacLauncher () at autovacuum.c:416
#9 0x01c11d3c in reaper (postgres_signal_arg=<optimized out>) at postmaster.c:3038
#10 <signal handler called>
#11 0xfdee6f64 in _rtld_symlook_list () from /usr/libexec/ld.elf_so
#12 0xfdee7644 in _rtld_symlook_default () from /usr/libexec/ld.elf_so
#13 0xfdee795c in _rtld_find_symdef () from /usr/libexec/ld.elf_so
#14 0xfdee7ad0 in _rtld_find_plt_symdef () from /usr/libexec/ld.elf_so
#15 0xfdee1918 in _rtld_bind () from /usr/libexec/ld.elf_so
#16 0xfdee1dc0 in _rtld_bind_secureplt_start () from /usr/libexec/ld.elf_so
Backtrace stopped: frame did not save the PC

which is pretty much just the same thing we were seeing before
commit 8acd8f869 :-(

Now, we certainly cannot think that these are occurring early in
postmaster startup. In the wake of 8acd8f869, we should expect
that there's no further need to call rtld_bind at all in the
postmaster, but seemingly that's not so. It's very frustrating
that the backtrace stops where it does :-(. It's also strange
that we're apparently running with signals enabled whereever
it is that rtld_bind is getting called from. Could it be that
sigaction is failing to install the requested signal mask, so
that one postmaster signal handler is interrupting another?

regards, tom lane

[1] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=mamba&dt=2022-11-24%2021%3A45%3A29
[2] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=mamba&dt=2022-11-29%2020%3A50%3A36

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Bruce Momjian 2022-11-30 01:51:31 Re: New docs chapter on Transaction Management and related changes
Previous Message Melanie Plageman 2022-11-30 01:12:47 Re: pg_stat_bgwriter.buffers_backend is pretty meaningless (and more?)