Re: PostmasterIsAlive() in recovery (non-USE_POST_MASTER_DEATH_SIGNAL builds)

From: Fujii Masao <masao(dot)fujii(at)oss(dot)nttdata(dot)com>
To: Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
Cc: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, David Rowley <dgrowleyml(at)gmail(dot)com>
Subject: Re: PostmasterIsAlive() in recovery (non-USE_POST_MASTER_DEATH_SIGNAL builds)
Date: 2020-09-18 18:07:09
Message-ID: f36d13f5-1701-2299-065b-086892b194d9@oss.nttdata.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 2020/09/18 9:30, Thomas Munro wrote:
> On Thu, Sep 17, 2020 at 10:47 PM Heikki Linnakangas <hlinnaka(at)iki(dot)fi> wrote:
>> Hmm, so for speedy response to postmaster death, you're relying on the
>> loops to have other postmaster-death checks besides
>> HandleStartupProcInterrupts(), in the form of WL_EXIT_ON_PM_DEATH. That
>> seems a bit fragile, at the very least it needs a comment in
>> HandleStartupProcInterrupts() to call it out.
>
> Surely that's the direction we want to go in, though, no? Commit
> cfdf4dc4 was intended to standardise the way we react to postmaster
> death where waiting is involved. I updated the comment in
> HandleStartupProcInterrupts() to highlight that the
> PostmasterIsAlive() check in there is only for the benefit of
> CPU-bound loops.
>
>> Note that there's one more loop in ShutdownWalRcv() that uses pg_usleep().
>
> Updating that one required me to invent a new wait_event for
> pg_stat_activity, which seems like progress.
>
> Unfortunately, while I was doing that I realised that WaitLatch()
> without WL_SET_LATCH was broken by commit 3347c982bab (in master
> only), in a way not previously reached by the tests. So 0001 is a
> patch to fix that.

- pgstat_report_wait_start(WAIT_EVENT_RECOVERY_PAUSE);
- pg_usleep(1000000L); /* 1000 ms */
- pgstat_report_wait_end();
+ WaitLatch(NULL, WL_EXIT_ON_PM_DEATH | WL_TIMEOUT, 1000,
+ WAIT_EVENT_RECOVERY_PAUSE);

This change may cause at most one second delay against the standby
promotion request during WAL replay pause? It's only one second,
but I'd like to avoid this (unnecessary) wait to shorten the failover time
as much as possible basically. So what about using WL_SET_LATCH here?

But when using WL_SET_LATCH, one concern is that walreceiver can
wake up the startup process too frequently even during WAL replay pause.
This is also problematic? I'm ok with this, but if not, using pg_usleep()
might be better as the original code does.

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2020-09-18 18:11:37 Re: factorial function/phase out postfix operators?
Previous Message Robert Haas 2020-09-18 17:42:53 Re: factorial function/phase out postfix operators?