Re: "wal receiver" process hang in syslog() while exiting after receiving SIGTERM while the postgres has been promoted.

From: Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
To: "Chen, Yan-Jack (NSB - CN/Hangzhou)" <yan-jack(dot)chen(at)nokia-sbell(dot)com>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: "wal receiver" process hang in syslog() while exiting after receiving SIGTERM while the postgres has been promoted.
Date: 2018-06-27 16:28:25
Message-ID: CAHGQGwERcjEORNh7NmdC8Theg+GCziUcFvi6nZACW9PK-JadhQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

>> We encounter one problem happened while we try to promote standby
>> postgres(version 9.6.9) server to active. From the trace(we triggered the
>> process abort). We can see the process was hang in syslog() while handling
>> SIGTERM. According to below article. Looks it is risky to write syslog in
>> signal handling. Any idea to avoid it?

ISTM that this issue can happen if ereport() can be called before
WalRcvImmediateInterruptOK flag is disabled, as follows.
In that case, if SIGTERM is sent while writing the log message,
the signal handler calls another ereport() because
WalRcvImmediateInterruptOK flag is still enabled.
Then walreceiver gets stuck...

EnableWalRcvImmediateExit();
wrconn = walrcv_connect(conninfo, false, "walreceiver", &err);
if (!wrconn)
ereport(ERROR,
(errmsg("could not connect to the primary server: %s", err)));
DisableWalRcvImmediateExit();

On Tue, Jun 26, 2018 at 5:12 PM, Chen, Yan-Jack (NSB - CN/Hangzhou)
<yan-jack(dot)chen(at)nokia-sbell(dot)com> wrote:
> Hi,
> Well, if you agree with do not write log in signal handling function in any circumstance? I see in many cases in postgresql signal handling function just set one flag which triggers its main process to handling the progress.
> How about simply remove the code lines?
>
> --- walreceiver_old.c
> +++ walreceiver.c
> @@ -816,10 +816,6 @@
>
> SetLatch(&WalRcv->latch);
>
> - /* Don't joggle the elbow of proc_exit */
> - if (!proc_exit_inprogress && WalRcvImmediateInterruptOK)
> - ProcessWalRcvInterrupts();
> -
> errno = save_errno;
> }

This change seems to cause another hung. Please imagine the case
where SIGTERM is sent while libpqrcv_connect() is waiting on the latch
(i.e., WaitLatchOrSocket()). In this case, SIGTERM causes libpqrcv_connect()
to wake up, call ResetLatch() and CHECK_FOR_INTERRUPTS(), and then
restart waiting on the latch. That is, walreceiver can get stuck on
libpqrcv_connect() in this case.

One idea to fix the above problem is to change CHECK_FOR_INTERRUPTS()
so that it calls ProcessWalRcvInterrupts() and then ereport(FATAL)
immediately if
WalRcvImmediateInterruptOK is true. Which can cause walreceiver to
ereport(FATAL) immediately after libpqrcv_connect() wakes up from
the wait on the latch.

Regards,

--
Fujii Masao

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Konstantin Knizhnik 2018-06-27 16:32:18 Monitoring time of fsyncing WALs
Previous Message Yugo Nagata 2018-06-27 16:22:42 Re: Small fixes about backup history file in doc and pg_standby