Re: Stronger safeguard for archive recovery not to miss data

From: Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>
To: masao(dot)fujii(at)oss(dot)nttdata(dot)com
Cc: david(at)pgmasters(dot)net, osumi(dot)takamichi(at)fujitsu(dot)com, pgsql-hackers(at)lists(dot)postgresql(dot)org, laurenz(dot)albe(at)cybertec(dot)at
Subject: Re: Stronger safeguard for archive recovery not to miss data
Date: 2021-03-31 05:23:26
Message-ID: 20210331.142326.1894636699406040375.horikyota.ntt@gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

At Wed, 31 Mar 2021 02:11:48 +0900, Fujii Masao <masao(dot)fujii(at)oss(dot)nttdata(dot)com> wrote in
>
>
> On 2021/03/26 22:14, David Steele wrote:
> > On 3/25/21 9:23 PM, Fujii Masao wrote:
> >>
> >> On 2021/03/25 23:21, David Steele wrote:
> >>> On 1/25/21 3:55 AM, Laurenz Albe wrote:
> >>>> On Mon, 2021-01-25 at 08:19 +0000, osumi(dot)takamichi(at)fujitsu(dot)com wrote:
> >>>>>> I think you should pst another patch where the second, now
> >>>>>> superfluous, error
> >>>>>> message is removed.
> >>>>>
> >>>>> Updated. This patch showed no failure during regression tests
> >>>>> and has been aligned by pgindent.
> >>>>
> >>>> Looks good to me.
> >>>> I'll set it to "ready for committer" again.
> >>>
> >>> Fujii, does the new patch in [1] address your concerns?
> >>
> >> No. I'm still not sure if this patch is good idea... I understand
> >> why this safeguard is necessary. OTOH I'm afraid it increases
> >> a bit the risk that users get unstartable database, i.e., lose whole
> >> database.
> >> But maybe I'm concerned about rare case and my opinion is minority
> >> one.
> >> So I'd like to hear more opinions about this patch.
> > After reviewing the patch I am +1. I think allowing corruption in
> > recovery by default is not a good idea. There is currently a warning
> > but I don't think that is nearly strong enough and is easily missed.
> > Also, "data may be missing" makes this sound like an acceptable
> > situation. What is really happening is corruption is being introduced
> > that may make the cluster unusable or at the very least lead to errors
> > during normal operation.
>
> Ok, now you, Osumi-san and Laurenz agree to this change
> while I'm only the person not to like this. So unless we can hear
> any other objections to this change, probably we should commit the
> patch.

Sorry for being late, but FWIW, I'm confused.

This patch checks if archive recovery is requested after shutting down
while in the minimal mode. Since we officially reject to take a
backup while wal_level=minimal, the error is not supposed to be raised
while recoverying from a past backup. If the cluster was running in
minimal mode, the archive recovery does nothing substantial (would end
with running a crash recvoery at worst). I'm not sure how the
concrete curruption mode caused by the operation looks like. I agree
that we should reject archive recovery on a wal_level=minimal cluster,
but the reason is not that the operation is breaking the past backup,
but that it's just nonsentical.

On the other hand, I don't think this patch effectively protect
archive from getting a stopping gap. Just starting the server without
archive recovery succeeds and the archive is successfully broken for
the backups in the past. In other words, if we are intending to let
users know that their backups have gotten a stopping gap, this patch
doesn't seem to work in that sense. I would object to reject starting
with wal_level=replica and without archive recovery after shutting
down in minimal mode. That's obviously an overkill.

So, what I think we should do for the objective is to warn if
archiving is enabled but backup is not yet taken. Yeah, (as I
remember that such discussion has been happened) I don't think we have
a reliable way to know that.

I apologize in advance if I'm missing something.

> > If we want to allow recovery to play past this point I think it would
> > make more sense to have a GUC (like ignore_invalid_pages [1]) that
> > allows recovery to proceed and emits a warning instead of fatal.
> > Looking the patch, I see a few things:
> > 1) Typo in the tests:
> > This protection shold apply to recovery mode
> > should be:
> > This protection should apply to recovery mode
> > 2) I don't think it should be the job of this patch to restructure the
> > if conditions, even if it is more efficient. It just obscures the
> > purpose of the patch.
>
> +1

So, I don't think even this is needed.

> > So, I would revert all the changes in xlog.c except changing the
> > warning to an error:
> > -        ereport(WARNING,
> > -                (errmsg("WAL was generated with wal_level=minimal,
> > -data may be missing"),
> > -                 errhint("This happens if you temporarily set
> > -wal_level=minimal without taking a new base backup.")));
> > +            ereport(FATAL,
> > +                    (errmsg("WAL was generated with
> > wal_level=minimal, cannot continue recovering"),
> > +                     errdetail("This happens if you temporarily set
> > wal_level=minimal on the server."),
> > +                     errhint("Run recovery again from a new base
> > backup taken after setting wal_level higher than minimal")));
> I guess that users usually encounter this error because they have not
> taken base backups yet after setting wal_level to higher than minimal
> and have to use the old base backup for archive recovery. So I'm not
> sure
> how much only this HINT is helpful for them. Isn't it better to append
> something like "If there is no such backup, recover to the point in
> time
> before wal_level is set to minimal even though which cause data loss,
> to start the server." into HINT?
> The following error will never be thrown because of the patch.
> So IMO the following code should be removed.
>
> if (ControlFile->wal_level < WAL_LEVEL_REPLICA)
> ereport(ERROR,
> (errmsg("hot standby is not possible because wal_level was
> not set to \"replica\" or higher on the primary server"),
> errhint("Either set wal_level to \"replica\" on the
> primary, or turn off hot_standby here.")));

It seems to be alredy removed in the latest patch?

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Kyotaro Horiguchi 2021-03-31 05:36:24 Re: Stronger safeguard for archive recovery not to miss data
Previous Message Etsuro Fujita 2021-03-31 05:12:07 Re: Asynchronous Append on postgres_fdw nodes.