Re: PANIC during crash recovery of a recently promoted standby

From: Michael Paquier <michael(at)paquier(dot)xyz>
To: Pavan Deolasee <pavan(dot)deolasee(at)gmail(dot)com>
Cc: Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: PANIC during crash recovery of a recently promoted standby
Date: 2018-05-24 07:57:07
Message-ID: 20180524075707.GE15445@paquier.xyz
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Mon, May 14, 2018 at 01:14:22PM +0530, Pavan Deolasee wrote:
> Looks like I didn't understand Alvaro's comment when he mentioned it to me
> off-list. But I now see what Michael and Alvaro mean and that indeed seems
> like a problem. I was thinking that the test for (ControlFile->state ==
> DB_IN_ARCHIVE_RECOVERY) will ensure that minRecoveryPoint can't be updated
> after the standby is promoted. While that's true for a DB_IN_PRODUCTION, the
> RestartPoint may finish after we have written end-of-recovery record, but
> before we're in production and thus the minRecoveryPoint may again be set.

Yeah, this has been something I considered as well first, but I was not
confident enough that setting up minRecoveryPoint to InvalidXLogRecPtr
was actually a safe thing for timeline switches.

So I have spent a good portion of today testing and playing with it to
be confident enough that this was right, and I have finished with the
attached. The patch adds a new flag to XLogCtl which marks if the
control file has been updated after the end-of-recovery record has been
written, so as minRecoveryPoint does not get updated because of a
restart point running in parallel.

I have also reworked the test case you sent, removing the manuals sleeps
and replacing them with correct wait points. There is also no point to
wait after promotion as pg_ctl promote implies a wait. Another
important thing is that you need to use wal_log_hints = off to see a
crash, which is something that allows_streaming actually enables.

Comments are welcome.
--
Michael

Attachment Content-Type Size
recovery-panic-michael.patch text/x-diff 7.4 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Maxim Boguk 2018-05-24 09:38:03 Re: found xmin from before relfrozenxid on pg_catalog.pg_authid
Previous Message Thomas Munro 2018-05-24 07:15:23 Re: PG11 jit failing on ppc64el