Re: Can a child process detect postmaster death when in pg_usleep?

From: Thomas Munro <thomas(dot)munro(at)gmail(dot)com>
To: Bharath Rupireddy <bharath(dot)rupireddyforpostgres(at)gmail(dot)com>
Cc: PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: Can a child process detect postmaster death when in pg_usleep?
Date: 2021-04-14 23:58:04
Message-ID: CA+hUKGKBycea2RuabeY4xAt9OERo9vq47x7iU=PwYYJO2Aawcw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi Bharath,

On Thu, Apr 15, 2021 at 2:06 AM Bharath Rupireddy
<bharath(dot)rupireddyforpostgres(at)gmail(dot)com> wrote:
> 1) Is it really harmful to use pg_usleep in a postmaster child process
> as it doesn't let the child process detect postmaster death?

Yeah, that's a bad idea. Any long-term waiting (including short waits
in a loop) should ideally be done with the latch infrastructure.

One interesting and unusual case is recovery: it can run for a very
long time without reaching a waiting primitive of any kind (other than
LWLock, which doesn't count), because it can be busy applying records
for hours at a time. In that case, we take special measures and
explicitly check if the postmaster is dead in the redo loop. In
theory, you could do the same in any loop containing pg_usleep() (we
used to have several loops doing that, especially around replication
code), but it'd be better to use the existing wait-event-multiplexing
technology we have, and keep improving that.

Some people have argued that long running queries should *also* react
faster when the PM exits, a bit like recovery ... which leads to the
next point...

> 2) Can pg_usleep() always detect signals? I see the caution in the
> pg_usleep function definition in pgsleep.c, saying the signal handling
> is platform dependent. We have code blocks like below in the code. Do
> we actually process interrupts before going to sleep with pg_usleep()?
> while/for loop
> {
> ......
> ......
> CHECK_FOR_INTERRUPTS();
> pg_usleep();
> }
> and
> if (PostAuthDelay)
> pg_usleep();

CHECK_FOR_INTERRUPTS() has nothing to do with postmaster death
detection, currently, so that'd be for dealing with interrupts, not
for that. Also, there would be a race: a signal on its own isn't
enough on systems where we have them and where select() is guaranteed
to wake up, because the signal might arrive between CFI() and
pg_usleep(100 years). latch.c knows how to void such problems.

There may be an argument that CFI() *should* be a potential
postmaster-death-exit point, instead of having WaitLatch() (or its
caller) handle it directly, but it's complicated. At the time the
postmaster pipe system was invented we didn't have a signals for this
so it wasn't even a candidate for treatment as an "interrupt". On
systems that have postmaster death signals today (Linux + FreeBSD, but
I suspect we can extend this to every Unix we support, see CF #3066,
and a solution for Windows has been mentioned too), clearly the signal
handler could set a new interrupt flag PostmasterLost +
InterruptPending, and then CHECK_FOR_INTERRUPTS() could see it and
exit. The argument against this is that exiting isn't always the
right thing! In a couple of places, we do something special, such as
printing a special error message (examples: sync rep and the main FEBE
client read). Look for WL_POSTMASTER_DEATH (as opposed to
WL_EXIT_ON_PM_DEATH). So I guess you'd need to reverse those
decisions and standardise on "exit immediately, no message", or
invented a way to suppress that behaviour in code regions.

> 3) Is it intentional to use pg_usleep in some places in the code? If
> yes, what are they? At least, I see one place where it's intentional
> in the wait_pid function which is used while running the regression
> tests.

There are plenty of places that do a short sleep for various reasons,
more like a deliberate stall or backoff or auth thing, and it's
probably OK if they're shortish and not really a condition polling
loop with an obvious latch/CV-based replacement. Note also that
LWLock waits are similar.

> 4) Are there any places where we need to replace pg_usleep with
> WaitLatch/equivalent of pg_sleep to detect the postmaster death
> properly?

We definitely have replaced a lot of sleeps with latch.c primitives
over the past few years, since we got WL_EXIT_ON_PM_DEATH and
condition variables. There may be many more to improve... You
mentioned autovacuum... yeah, Stephen fixed one of these with commit
4753ef37, but yeah it's not great to have those others in there...

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Robert Haas 2021-04-15 00:08:10 Re: New IndexAM API controlling index vacuum strategies
Previous Message Mark Dilger 2021-04-14 22:18:48 Re: Converting contrib SQL functions to new style