RE: Stronger safeguard for archive recovery not to miss data

From: "osumi(dot)takamichi(at)fujitsu(dot)com" <osumi(dot)takamichi(at)fujitsu(dot)com>
To: 'Laurenz Albe' <laurenz(dot)albe(at)cybertec(dot)at>, Fujii Masao <masao(dot)fujii(at)oss(dot)nttdata(dot)com>, 'Kyotaro Horiguchi' <horikyota(dot)ntt(at)gmail(dot)com>
Cc: "pgsql-hackers(at)lists(dot)postgresql(dot)org" <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: RE: Stronger safeguard for archive recovery not to miss data
Date: 2021-01-21 11:49:07
Message-ID: OSBPR01MB488879E8A1807F595AD82B27EDA10@OSBPR01MB4888.jpnprd01.prod.outlook.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi

On Wed, Jan 20, 2021 9:03 PM Laurenz Albe <laurenz(dot)albe(at)cybertec(dot)at> wrote:
>
> On Wed, 2021-01-20 at 13:10 +0900, Fujii Masao wrote:
> > + errhint("Run recovery again from a
> > + new base backup taken after setting wal_level higher than
> > + minimal")));
> >
> > Isn't it impossible to do this in normal archive recovery case? In
> > that case, since the server crashed and the database got corrupted,
> > probably we cannot take a new base backup.
> >
> > Originally even when users accidentally set wal_level to minimal, they
> > could start the server from the backup by disabling hot_standby and salvage
> the data.
> > But with the patch, we lost the way to do that. Right? I was wondering
> > that WARNING was used intentionally there for that case.
OK. I understand that this WARNING is necessary to give user a chance
to start up the server again and salvage his/her data,
when hot_standby=off and the server goes through a period of wal_level=minimal
during an archive recovery.

> I would argue that if you set your "wal_level" to minimal, do some work, reset it
> to "replica" and recover past that, two things could happen:
>
> 1. Since "archive_mode" was off, you are missing some WAL segments and
> cannot recover past that point anyway.
>
> 2. You are missing some relevant WAL records, and your recovered
> database is corrupted. This is very likely, because you probably
> switched to "minimal" with the intention to generate less WAL.
>
> Now I see your point that a corrupted database may be better than no database
> at all, but wouldn't you agree that a warning in the log (that nobody reads) is
> too little information?
>
> Normally we don't take such a relaxed attitude towards data corruption.
Yeah, we thought we needed stronger protection for that reason.

> Perhaps there could be a GUC "recovery_allow_data_corruption" to override this
> check, but I'd say that a warning is too little.
>
> > if (ControlFile->wal_level < WAL_LEVEL_REPLICA)
> > ereport(ERROR,
> > (errmsg("hot standby is not
> possible because wal_level was not set to \"replica\" or higher on the primary
> server"),
> > errhint("Either set wal_level
> > to \"replica\" on the primary, or turn off hot_standby here.")));
> >
> > With the patch, we never reach the above code?
>
> Right, that would have to go. I didn't notice that.
Adding a condition to check if "recovery_allow_data_corruption" is 'on' around the end of
CheckRequiredParameterValues() sounds safer for me too, although
implementing a new GUC parameter sounds bigger than what I expected at first.
The default of the value should be 'off' to protect users from getting the corrupted server.
Does everyone agree with this direction ?

Best Regards,
Takamichi Osumi

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Dilip Kumar 2021-01-21 11:53:24 Re: Is Recovery actually paused?
Previous Message Dean Rasheed 2021-01-21 11:11:55 Re: PoC/WIP: Extended statistics on expressions