Re: Crash by targetted recovery

From: Fujii Masao <masao(dot)fujii(at)oss(dot)nttdata(dot)com>
To: Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>
Cc: pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject: Re: Crash by targetted recovery
Date: 2020-03-05 10:51:11
Message-ID: 0db8f83f-5b66-75e9-6d99-93f19fc33165@oss.nttdata.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 2020/03/05 12:08, Kyotaro Horiguchi wrote:
> At Mon, 2 Mar 2020 20:54:04 +0900, Fujii Masao <masao(dot)fujii(at)oss(dot)nttdata(dot)com> wrote in
>>> And random access during StandbyMode ususally (always?) lets RecPtr go
>>> back. I'm not sure WaitForWALToBecomeAvailable works correctly if we
>>> don't have a file in pg_wal and the REDO point is far back by more
>>> than a segment from the initial checkpoint record.
>>
>> It works correctly. This is why WaitForWALToBecomeAvailable() uses
>> fetching_ckpt argument.
>
> Hmm. You're right. We start streaming from RedoStartLSN when
> fetching_ckpt. So that doesn't happen.
>
>>> If we go back to XLOG_FROM_ARCHIVE by random access, it correctly
>>> re-connects to the primary for the past segment.
>>
>> But this can lead to unnecessary restart of walreceiver. Since
>> fetching_ckpt ensures that the WAL file containing the REDO
>> starting record exists in pg_wal, there is no need to reconnect
>> to the primary when reading the REDO starting record.
>>
>> Is there other case where we need to go back to XLOG_FROM_ARCHIVE
>> by random access?
>
> I understand that the reconnection for REDO record is useless. Ok I
> take the !StandbyMode way.
>
> The attached is the test script that is changed to count the added
> test, and the slight revised main patch.

Thanks for the patch!

+ /* Wal receiver should not active when entring XLOG_FROM_ARCHIVE */
+ Assert(!WalRcvStreaming());

+1 to add this assertion check.

Isn't it better to always check this while trying to read WAL from
archive or pg_wal? So, what about the following change?

{
case XLOG_FROM_ARCHIVE:
case XLOG_FROM_PG_WAL:
+ /*
+ * WAL receiver should not be running while trying to
+ * read WAL from archive or pg_wal.
+ */
+ Assert(!WalRcvStreaming());
+
/* Close any old file we might have open. */
if (readFile >= 0)

+ lastSourceFailed = false; /* We haven't failed on the new source */

Is this really necessary? Since ReadRecord() always reset
lastSourceFailed to false, it seems not necessary.

- else if (currentSource == 0)
+ else if (currentSource == 0 ||

Though this is a *separate topic*, 0 should be XLOG_FROM_ANY?
There are some places where 0 is used as the value of currentSource.
IMO they should be updated so that XLOG_FROM_ANY is used instead of 0.

Regards,

--
Fujii Masao
NTT DATA CORPORATION
Advanced Platform Technology Group
Research and Development Headquarters

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Fujii Masao 2020-03-05 11:16:11 Re: Some problems of recovery conflict wait events
Previous Message Darafei Komяpa Praliaskouski 2020-03-05 10:16:22 Re: Berserk Autovacuum (let's save next Mandrill)