Re: Strange failure on mamba

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Andres Freund <andres(at)anarazel(dot)de>
Cc: Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Strange failure on mamba
Date: 2022-11-30 23:33:06
Message-ID: 1467663.1669851186@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Andres Freund <andres(at)anarazel(dot)de> writes:
> On 2022-11-30 00:55:42 -0500, Tom Lane wrote:
>> Googling LD_BIND_NOW suggests that that's a Linux thing; do you know that
>> it should have an effect on NetBSD?

> I'm not at all sure it does, but I did see it listed in
> https://man.netbsd.org/ld.elf_so.1
> LD_BIND_NOW If defined immediate binding of Procedure Link Table
> (PLT) entries is performed instead of the default lazy
> method.

I checked the source code, and learned that (1) yes, rtld does pay
attention to this, and (2) the documentation lies: it has to be not
only defined, but nonempty, to get any effect.

Also, I dug into my stuck processes some more, and I have to take
back the claim that this is happening later than postmaster startup.
All the stuck children are ones that either are launched on request
from the startup process, or are launched as soon as we get the
termination report for the startup process. So it's plausible that
the problem is happening during the postmaster's first select()
wait. I then got dirty with the assembly code, and found out that
where the stack trace stops is an attempt to resolve this call:

0xfd6f7a48 <__select50+76>: bl 0xfd700ed0 <0000803c.got2.plt_pic32._sys___select50>

which is inside libpthread.so and is trying to call something in libc.so.
So we successfully got to the select() function from PostmasterMain, but
that has a non-prelinked call to someplace else, and kaboom.

In short, looks like Andres' theory is right. It means that 8acd8f869
didn't actually fix anything, though it reduced the probability of the
failure by reducing the number of vulnerable PLT-indirect calls.

I've adjusted mamba to set LD_BIND_NOW=1 in its environment.
I've verified that that causes the call inside __select50
to get resolved before we reach main(), so I'm hopeful that
it will cure the issue. But it'll probably be a few weeks
before we can be sure.

Don't have a good idea about a non-band-aid fix. Perhaps we
should revert 8acd8f869 altogether, but then what? Even if
somebody comes up with a rewrite to avoid doing interesting
stuff in the postmaster's signal handlers, we surely wouldn't
risk back-patching it.

It's possible that doing nothing is okay, at least in the
short term. It's probably nigh impossible to hit this
issue on modern multi-CPU hardware. Or perhaps we could revive
the idea of having postmaster.c do one dummy select() call
before it unblocks signals.

regards, tom lane

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2022-11-30 23:36:47 Re: Non-decimal integer literals
Previous Message David Rowley 2022-11-30 23:11:56 Re: Non-decimal integer literals