Reinitialize stack base after fork (for the benefit of rr)?

From: Andres Freund <andres(at)anarazel(dot)de>
To: pgsql-hackers(at)postgresql(dot)org, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Subject: Reinitialize stack base after fork (for the benefit of rr)?
Date: 2020-03-27 18:22:17
Message-ID: 20200327182217.ubrrl32lyfhxfwk5@alap3.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

I've found rr [1] very useful to debug issues in postgres. The ability
to hit a bug, and then e.g. identify a pointer with problematic
contents, set a watchpoint on its contents, and reverse-continue is
extremely powerful.

Unfortunately, when running postgres, it currently occasionally triggers
spurious stack-too-deep errors. That turns out to be because it has to
use an alternative stack in some corner cases (IIUC when a signal
arrives while already in a signal handler). That corner case can
unfortunately be hit from within postmaster, and at least can lead to
sigusr1_handler() being called with an alternative stack set.

Unfortunately that means that processes that postmaster fork()s while
using that alternative stack will continue their live using that
alternative stack. Which then subsequently means that our stack depth
checks always trigger.

I've not seen this trigger for normal backends (which makes sense,
they're not started from a signal handler), but for bgworkers. In
particular parallel workers are prone to hit the issue.

I've locally fixed the issue by computing the stack base address anew
for postmaster children. Currently in InitPostmasterChild().

I'd like to get that change upstream. The rr hackers have fixed a number
of other issues that could be hit with postgres, but they couldn't see a
good way to address the potential for a different signal stack in this
edge case. And it doesn't seem crazy to me to compute the stack base
again in postmaster children: It's cheap enough and it's extremely
unlikely that postmaster uses up a crazy amount of stack.

I also don't find it too crazy to guard against forks in signal handlers
leading to a different stack base address. It's a pretty odd thing to
do.

Tom, while imo not a fix of the right magnitude here: Are you planning /
hoping to work again on your postmaster latch patch? I think it'd be
really good if we could restructure the postmaster code to do far far
less in signal handlers. And the postmaster latch patch seems like a big
step in that direction. I think we mostly dropped it due to the release
schedule last time round?

Greetings,

Andres Freund

[1] https://github.com/mozilla/rr/

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Hamid Akhtar 2020-03-27 18:22:18 Re: BUG #16040: PL/PGSQL RETURN QUERY statement never uses a parallel plan
Previous Message Robert Haas 2020-03-27 18:13:17 Re: backup manifests