Re: Strange failure on mamba

From: Andres Freund <andres(at)anarazel(dot)de>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Strange failure on mamba
Date: 2022-12-01 00:19:57
Message-ID: 20221201001957.htscqgtd3fftnuf4@awork3.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

On 2022-11-30 18:33:06 -0500, Tom Lane wrote:
> Also, I dug into my stuck processes some more, and I have to take
> back the claim that this is happening later than postmaster startup.
> All the stuck children are ones that either are launched on request
> from the startup process, or are launched as soon as we get the
> termination report for the startup process. So it's plausible that
> the problem is happening during the postmaster's first select()
> wait. I then got dirty with the assembly code, and found out that
> where the stack trace stops is an attempt to resolve this call:
>
> 0xfd6f7a48 <__select50+76>: bl 0xfd700ed0 <0000803c.got2.plt_pic32._sys___select50>
>
> which is inside libpthread.so and is trying to call something in libc.so.
> So we successfully got to the select() function from PostmasterMain, but
> that has a non-prelinked call to someplace else, and kaboom.

This whole area just seems quite broken in netbsd :(.

We're clearly doing stuff in a signal handler that we really shouldn't, but
not being able to call any functions implemented in libc, even if they're
async signal safe (as e.g. select is) means signals are basically not
usable. Afaict this basically means that signals are *never* safe on netbsd,
as long as there's a single external function call in a signal handler.

> I've adjusted mamba to set LD_BIND_NOW=1 in its environment.
> I've verified that that causes the call inside __select50
> to get resolved before we reach main(), so I'm hopeful that
> it will cure the issue. But it'll probably be a few weeks
> before we can be sure.
>
> Don't have a good idea about a non-band-aid fix.

It's also a band aid, but perhaps a bit more reliable: We could link
statically to libc and libpthread.

Another approach could be to iterate over the loaded shared libraries during
postmaster startup and force symbols to be resolved. IIRC there's functions
that'd allow that. But it seems like a lot of work to work around an OS bug.

> Perhaps we should revert 8acd8f869 altogether, but then what?

FWIW, I think we should consider using those flags everywhere for the backend
- they make copy-on-write more effective and decrease connection overhead a
bit, because otherwise each backend process does the same symbol resolutions
again and again, dirtying memory post-fork.

> Even if somebody comes up with a rewrite to avoid doing interesting stuff in
> the postmaster's signal handlers, we surely wouldn't risk back-patching it.

Would that actually fix anything, given netbsd's brokenness? If we used a
latch like mechanism, the signal handler would still use functions in libc. So
postmaster could deadlock, at least during the first execution of a signal
handler? So I think 8acd8f869 continues to be important...

Greetings,

Andres Freund

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2022-12-01 00:36:09 Re: Strange failure on mamba
Previous Message Michael Paquier 2022-11-30 23:46:09 Re: Tests for psql \g and \o