Re: Possible corruption by CreateRestartPoint at promotion

From: Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>
To: nathandbossart(at)gmail(dot)com
Cc: pgsql-hackers(at)lists(dot)postgresql(dot)org, masao(dot)fujii(at)oss(dot)nttdata(dot)com
Subject: Re: Possible corruption by CreateRestartPoint at promotion
Date: 2022-04-27 01:43:53
Message-ID: 20220427.104353.691219597727808197.horikyota.ntt@gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

At Tue, 26 Apr 2022 11:33:49 -0700, Nathan Bossart <nathandbossart(at)gmail(dot)com> wrote in
> On Wed, Mar 16, 2022 at 10:24:44AM +0900, Kyotaro Horiguchi wrote:
> > While discussing on additional LSNs in checkpoint log message,
> > Fujii-san pointed out [2] that there is a case where
> > CreateRestartPoint leaves unrecoverable database when concurrent
> > promotion happens. That corruption is "fixed" by the next checkpoint
> > so it is not a severe corruption.
>
> I suspect we'll start seeing this problem more often once end-of-recovery
> checkpoints are removed [0]. Would you mind creating a commitfest entry
> for this thread? I didn't see one.

I'm not sure the patch makes any change here, because restart points
don't run while crash recovery, since no checkpoint records seen
during a crash recovery. Anyway the patch doesn't apply anymore so
rebased, but only the one for master for the lack of time for now.

> > AFAICS since 9.5, no check(/restart)pionts won't run concurrently with
> > restartpoint [3]. So I propose to remove the code path as attached.
>
> Yeah, this "quick hack" has been around for some time (2de48a8), and I
> believe much has changed since then, so something like what you're
> proposing is probably the right thing to do.

Thanks for checking!

> > /* Also update the info_lck-protected copy */
> > SpinLockAcquire(&XLogCtl->info_lck);
> > - XLogCtl->RedoRecPtr = lastCheckPoint.redo;
> > + XLogCtl->RedoRecPtr = RedoRecPtr;
> > SpinLockRelease(&XLogCtl->info_lck);
> >
> > /*
> > @@ -6984,7 +6987,10 @@ CreateRestartPoint(int flags)
> > /* Update the process title */
> > update_checkpoint_display(flags, true, false);
> >
> > - CheckPointGuts(lastCheckPoint.redo, flags);
> > + CheckPointGuts(RedoRecPtr, flags);
>
> I don't understand the purpose of these changes. Are these related to the
> fix, or is this just tidying up?

The latter, since the mixed use of two not-guaranteed-to-be-same
variables at the same time for the same purpose made me perplexed (but
I feel the change can hardly incorporated alone). However, you're
right that it is irrelevant to the fix, so removed including other
instances of the same.

> [0] https://postgr.es/m/CA%2BTgmoY%2BSJLTjma4Hfn1sA7S6CZAgbihYd%3DKzO6srd7Ut%3DXVBQ%40mail.gmail.com

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

Attachment Content-Type Size
v5-0001-Correctly-update-contfol-file-at-the-end-of-archi.patch text/x-patch 4.9 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2022-04-27 01:47:53 Re: pgsql: Add contrib/pg_walinspect.
Previous Message Michael Paquier 2022-04-27 01:25:40 Re: [PATCH] Teach pg_waldump to extract FPIs from the WAL