Re: PATCH: standby crashed when replay block which truncated in standby but failed to truncate in master node

From: Michael Paquier <michael(at)paquier(dot)xyz>
To: Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
Cc: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Thunder <thunder1(at)126(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: PATCH: standby crashed when replay block which truncated in standby but failed to truncate in master node
Date: 2019-11-29 02:39:48
Message-ID: 20191129023948.GE2505@paquier.xyz
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Thu, Oct 03, 2019 at 05:54:40PM +0900, Fujii Masao wrote:
> On Thu, Oct 3, 2019 at 1:57 PM Michael Paquier <michael(at)paquier(dot)xyz> wrote:
> >
> > On Thu, Oct 03, 2019 at 01:49:34PM +0900, Fujii Masao wrote:
> > > But this can cause subsequent recovery to always fail with invalid-pages error
> > > and the server not to start up. This is bad. So, to allviate the situation,
> > > I'm thinking it would be worth adding something like igore_invalid_pages
> > > developer parameter. When this parameter is set to true, the startup process
> > > always ignores invalid-pages errors. Thought?
> >
> > That could be helpful.
>
> So attached patch adds new developer GUC "ignore_invalid_pages".
> Setting ignore_invalid_pages to true causes the system
> to ignore the failure (but still report a warning), and continue recovery.
>
> I will add this to next CommitFest.

No actual objections against this patch from me as a dev option.

+ Detection of WAL records having references to invalid pages during
+ recovery causes <productname>PostgreSQL</productname> to report
+ an error, aborting the recovery. Setting
Well, that's not really an error. This triggers a PANIC, aka crashes
the server. And in this case the actual problem is that you may not
be able to move on with recovery when restarting the server again,
except if luck is on your side because you would continuously face
it..

+ recovery. This behavior may <emphasis>cause crashes, data loss,
+ propagate or hide corruption, or other serious problems</emphasis>.
Nit: indentation on the second line here.

+ However, it may allow you to get past the error, finish the recovery,
+ and cause the server to start up.
For consistency here I would suggest the second part of the sentence
to be "TO finish recovery, and TO cause the server to start up".

+ The default setting is off, and it can only be set at server start.
Nit^2: Missing a <literal> markup for "off"?
--
Michael

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Michael Paquier 2019-11-29 03:03:05 Re: A problem about partitionwise join
Previous Message Michael Paquier 2019-11-29 02:32:15 Re: Write visibility map during CLUSTER/VACUUM FULL