Re: BUG #17744: Fail Assert while recoverying from pg_basebackup

From: Thomas Munro <thomas(dot)munro(at)gmail(dot)com>
To: Michael Paquier <michael(at)paquier(dot)xyz>
Cc: Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>, andres(at)anarazel(dot)de, zxwsbg12138(at)gmail(dot)com, pgsql-bugs(at)lists(dot)postgresql(dot)org
Subject: Re: BUG #17744: Fail Assert while recoverying from pg_basebackup
Date: 2023-02-25 20:38:01
Message-ID: CA+hUKGJWOsxT-kkMmLwvFn-VVmZ3vnYBeFw7=_Aq5txdFdGUEg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs

On Sat, Feb 25, 2023 at 4:39 PM Michael Paquier <michael(at)paquier(dot)xyz> wrote:
> By the way, I think that I have put my finger on something.. The
> startup process happily sets InArchiveRecovery to true after
> read_backup_label() if finding one, with a comment telling that
> archive recovery is requested, still ArchiveRecoveryRequested is never
> set to true unless there is a recovery.signal or a standby.signal.

Yeah, it's all a bit confusing. That particular fact is the reason we
can't just rely on RECOVERY_STATE_CRASH to suppress restart points in
RecoveryRestartPoint(), which was my first naive idea before I started
working through scenarios... Another thing that is a little confusing
is that commit 7ff23c6 didn't change the behaviour of hot standbys in
this respect: they already started the checkpointer at the start of
recovery, not at the start of hot standby. Which leads to Bowen's
suggestion to make recovery-from-backup work the same.

As for what we're actually going to do about this, and how to test
that it's working the way we want, here's my current idea, which I'm
planning to work up experimental patches for on Monday:

(1) Since it's quite hard to hit the assertion failure without
generating a megaton of data, or hacking checkpoint_timeout to allow 1
second etc, I think we should test the startup process's behaviour
instead (ie rather than the checkpointer's). We could have
RecoveryRestartPoint() log when it's recording a restart point in
shared memory (= a potential restart point, whether or not the
checkpoint chooses to act on it), and also when it's choosing to skip.
At DEBUG2 level. Then we could then have a test 035_restart_points.pl
for all the various situations (actual crash, backup, backup +
archives, pitr, with standby enabled, etc etc), and make assertions
about which messages we log. That gives us observability and nails
down the differences between these modes.

(2) Then I think we should simply prevent restart points that 14
wouldn't have allowed, just a couple of back-patchable lines, maybe
(yet another) new variable.

Then for master I think we should (in later patches, for 16 or maybe
17) consider relaxing that. There doesn't seem to be a technical
reason why a very long running PITR shouldn't benefit from restart
points, and that would be in line with the direction of making
recovery modes more similar, but given we/I missed this subtlety in
the past, the more conservative choice seems more appropriate for 15.

In response to

Responses

Browse pgsql-bugs by date

  From Date Subject
Next Message Tom Lane 2023-02-25 20:55:50 Re: BUG #17800: ON CONFLICT DO UPDATE fails to detect incompatible fields that leads to a server crash
Previous Message Andres Freund 2023-02-25 20:19:42 Re: BUG #17800: ON CONFLICT DO UPDATE fails to detect incompatible fields that leads to a server crash