Delay of standby shutdown

From: Fujii Masao <masao(dot)fujii(at)oss(dot)nttdata(dot)com>
To: pgsql-hackers(at)postgresql(dot)org
Subject: Delay of standby shutdown
Date: 2020-09-17 13:49:03
Message-ID: 9d7e6ab0-8a53-ddb9-63cd-289bcb25fe0e@oss.nttdata.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

When I shut down the standby server just after setting up replication
environment and shutting down the primary, I found that it took
five seconds to complete the shutdown of the standby server. Note that
no WAL was generated between the setup of replication and the server
shutdown, in this case. That is, basically neither WAL replay nor
shutdown restart point didn't cause this issue, I think. Here are
the log messages in that standby. I have observed this issue some times
so far.

2020-09-17 21:29:54 JST [postmaster] sby1 LOG: received fast shutdown request
2020-09-17 21:29:54 JST [postmaster] sby1 LOG: aborting any active transactions
2020-09-17 21:29:59 JST [checkpointer] sby1 LOG: shutting down
2020-09-17 21:29:59 JST [checkpointer] sby1 LOG: restartpoint starting: shutdown immediate
2020-09-17 21:29:59 JST [checkpointer] sby1 LOG: restartpoint complete: wrote 0 buffers (0.0%); 0 WAL file(s) added, 0 removed, 1 recycled; write=0.014 s, sync=0.000 s, total=0.022 s; sync files=0, longest=0.000 s, average=0.000 s; distance=16384 kB, estimate=16384 kB
2020-09-17 21:29:59 JST [checkpointer] sby1 LOG: recovery restart point at 0/5000060
2020-09-17 21:29:59 JST [postmaster] sby1 LOG: database system is shut down

Per my analysis, this issue could happen in the following way.

1. In WaitForWALToBecomeAvailable(), the startup proesss receives
the shutdown request during sleeping wal_retrieve_retry_interval.
2. In ReadRecord(), the startup process thinks no valid record available
and loops back to retry reading WAL from stream.
3. In WaitForWALToBecomeAvailable(), the startup proesss tries to
start up walreceiver and then waits for recoveryWakeupLatch.
4. After the timeout (5 seconds is used for that wait for recoveryWakeupLatch)
passes, HandleStartupProcInterrupts() is called and the shutdown request
is processed.

So waiting at #3 for recoveryWakeupLatch caused the delay of
the standby shutdown.

To avoid this issue, I think that ReadRecord() should call
HandleStartupProcInterrupts() whenever looping back to retry.
Patch attached. Thought?

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION

Attachment Content-Type Size
shutdown_in_standby_v1.patch text/plain 738 bytes

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Dmitry Dolgov 2020-09-17 13:56:27 Re: [HACKERS] [PATCH] Generic type subscripting
Previous Message Jakub Wartak 2020-09-17 13:18:55 Re: Parallelize stream replication process