Re: Stronger safeguard for archive recovery not to miss data

From: Fujii Masao <masao(dot)fujii(at)oss(dot)nttdata(dot)com>
To: Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>
Cc: osumi(dot)takamichi(at)fujitsu(dot)com, david(at)pgmasters(dot)net, pgsql-hackers(at)lists(dot)postgresql(dot)org, laurenz(dot)albe(at)cybertec(dot)at
Subject: Re: Stronger safeguard for archive recovery not to miss data
Date: 2021-04-05 12:16:04
Message-ID: d9eaa61f-1854-b259-1957-c9bf94f1ab22@oss.nttdata.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 2021/04/05 16:13, Kyotaro Horiguchi wrote:
> At Mon, 5 Apr 2021 12:34:53 +0900, Fujii Masao <masao(dot)fujii(at)oss(dot)nttdata(dot)com> wrote in
>>
>>
>> On 2021/04/04 11:58, osumi(dot)takamichi(at)fujitsu(dot)com wrote:
>>>> IMO it's better to comment why this server restart is necessary.
>>>> As far as I understand correctly, this is necessary to ensure the WAL
>>>> file
>>>> containing the record about the change of wal_level (to minimal) is
>>>> archived,
>>>> so that the subsequent archive recovery will be able to replay it.
>>> OK, added some comments. Further, I felt the way I wrote this part was
>>> not good at all and self-evident
>>> and developers who read this test would feel uneasy about that point.
>>> So, a little bit fixed that test so that we can get clearer conviction
>>> for wal archive.
>>
>> LGTM. Thanks for updating the patch!
>>
>> Attached is the updated version of the patch. I applied the following
>> changes.
>
> + errhint("Use a backup taken after setting wal_level to higher than minimal "
> + "or recover to the point in time before wal_level was changed to minimal even though it may cause data loss.")));
>
> Looking the HINT message, I thought that it's hard to find where up to
> I should recover.

Yes. And, what's the worse, when archive recovery finds WAL generated with
wal_level=minimal and fails, "minimal" is saved in pg_control's wal_level.
This means that subsequent archive recovery always fails at the beginning of
recovery (before entering WAL replay main loop), in that case.
So even if recovery_targrt_lsn is specified, archive recovery fails before
checking that. Any recovery target settings have no effect on that case.

Maybe we can avoid this, for example, by changing xlog_redo() so that
it calls CheckRequiredParameterValues() before UpdateControlFile().
But I'm not sure if this change is safe. Probably we need more time to
consider this, but right now there is no so much time left at this stage.

At least the HINT message "or recover to the point in time before wal_level
was changed to minimal even though it may cause data loss." should be
removed because it's not helpful at all...

Ok, so if archive recovery finds WAL generated with wal_level=minimal and fails,
and also there is no backup taken after wal_level is set to higher than minimal,
basically [1] we lose whole database. I think that those who set wal_level to
minimal understand that this setting can cause data loss, for example,
any data loaded with wal_level=minimal may be lost later. But I'm afraid
that they might not understand the risk of whole database loss.

Even if they take new backup just after they set wal_level to higher than
minimal, there is still the risk of whole database loss until the backup is
completed.

This makes me think that we should document this risk.... Thought?

[1]
BTW, one very tricky way to recover from this situation seems to
copy all required WAL files from the archive to pg_wal and forcibly
run a crash recovery from the backup. Since crash recovery doesn't
check wal_level, we can avoid the issue by doing that. But this is
very tricky.

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Euler Taveira 2021-04-05 13:11:04 Re: Logical Replication - improve error message while adding tables to the publication in check_publication_add_relation
Previous Message Euler Taveira 2021-04-05 12:15:33 Re: Any objections to implementing LogicalDecodeMessageCB for pgoutput?