Re: BUG #17744: Fail Assert while recoverying from pg_basebackup

From: Michael Paquier <michael(at)paquier(dot)xyz>
To: Andres Freund <andres(at)anarazel(dot)de>
Cc: Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>, Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, zxwsbg12138(at)gmail(dot)com, pgsql-bugs(at)lists(dot)postgresql(dot)org
Subject: Re: BUG #17744: Fail Assert while recoverying from pg_basebackup
Date: 2023-02-21 03:51:51
Message-ID: Y/Q/17rpYS7YGbIt@paquier.xyz
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs

On Wed, Feb 01, 2023 at 07:32:52AM -0800, Andres Freund wrote:
> I might be missing something, but I suspect the problem here is that we
> shouldn't have been creating a restart point. Afaict, the setup
> instructions provided don't configure a recovery.signal, so we'll just
> perform crash recovery.
>
> And I don't think it'd ever make sense to create a restart point during
> crash recovery?

A restart point should never be created during crash recovery until we
reach a consistent state, because there could be a risk of seeing
inconsistent pages, for one, no? SwitchIntoArchiveRecovery() would
be the portion of the code doing the switch from crash to archive
recovery if it was requested, as called in the callback to read
records in ReadRecord().

> Except that in this case, it's not pure crash recovery, it's restoring
> from a backup label. Due to which it actually might make sense to create
> restart points?

Hmm. I'd like to think that these days requesting replay from a
backup_label file without a recovery.signal or a standby.signal is
just asking for trouble. Still, the case of the OP, by just having a
backup label, is equivalent to restoring from a self-contained backup,
which is something that should work on its own even if there is no
recovery.signal.

> If you're doing PITR or such you don't really gain
> anything by doing checkpoints until you've reached consistency, unless
> you want to optimize for the case that you might need to start/stop the
> instance multiple times?

Yes, that could help in some cases. The PITR target could be far away
from the consistent point, and the user could have just copied WAL
segments in pg_wal/ rather than relying on a recovery.signal with at
least a restore_command. That would be fancy, still being able to
start/stop and retry targets may make sense..

So, yes, it seems to me that the correct answer here would be just to
skip restart points as long as SharedRecoveryState is in
RECOVERY_STATE_CRASH, because the checkpointer relies on
RecoveryInProgress() and it cannot make the difference between crash
recovery and archive recovery. But now the checkpointer is started
during recovery to ease the end-of-recovery checkpoint process.
--
Michael

In response to

Responses

Browse pgsql-bugs by date

  From Date Subject
Next Message Kyotaro Horiguchi 2023-02-21 04:36:23 Re: BUG #17789: process_pgfdw_appname() fails for autovacuum workers
Previous Message Charles 2023-02-21 02:45:24 Re: Query run in 27s with 15.2 vs 37ms with 14.6