Re: Corruption during WAL replay

From: Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>
To: robertmhaas(at)gmail(dot)com
Cc: ibrar(dot)ahmad(at)gmail(dot)com, tejeswarm(at)hotmail(dot)com, andres(at)anarazel(dot)de, hlinnaka(at)iki(dot)fi, masahiko(dot)sawada(at)2ndquadrant(dot)com, pgsql-hackers(at)postgresql(dot)org, hexexpert(at)comcast(dot)net
Subject: Re: Corruption during WAL replay
Date: 2021-09-27 08:30:36
Message-ID: 20210927.173036.115679524750553023.horikyota.ntt@gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Thank you for the comments! (Sorry for the late resopnse.)

At Tue, 10 Aug 2021 14:14:05 -0400, Robert Haas <robertmhaas(at)gmail(dot)com> wrote in
> On Thu, Mar 4, 2021 at 10:01 PM Kyotaro Horiguchi
> <horikyota(dot)ntt(at)gmail(dot)com> wrote:
> > The patch assumed that CHKPT_START/COMPLETE barrier are exclusively
> > used each other, but MarkBufferDirtyHint which delays checkpoint start
> > is called in RelationTruncate while delaying checkpoint completion.
> > That is not a strange nor harmful behavior. I changed delayChkpt to a
> > bitmap integer from an enum so that both barrier are separately
> > triggered.
> >
> > I'm not sure this is the way to go here, though. This fixes the issue
> > of a crash during RelationTruncate, but the issue of smgrtruncate
> > failure during RelationTruncate still remains (unless we treat that
> > failure as PANIC?).
>
> I like this patch. As I understand it, we're currently cheating by
> allowing checkpoints to complete without necessarily flushing all of
> the pages that were dirty at the time we fixed the redo pointer out to
> disk. We think this is OK because we know that those pages are going
> to get truncated away, but it's not really OK because when the system
> starts up, it has to replay WAL starting from the checkpoint's redo
> pointer, but the state of the page is not the same as it was at the
> time when the redo pointer was the end of WAL, so redo fails. In the
> case described in
> http://postgr.es/m/BYAPR06MB63739B2692DC6DBB3C5F186CABDA0@BYAPR06MB6373.namprd06.prod.outlook.com
> modifications are made to the page before the redo pointer is fixed
> and those changes never make it to disk, but the truncation also never
> makes it to the disk either. With this patch, that can't happen,
> because no checkpoint can intervene between when we (1) decide we're
> not going to bother writing those dirty pages and (2) actually
> truncate them away. So either the pages will get written as part of
> the checkpoint, or else they'll be gone before the checkpoint
> completes. In the latter case, I suppose redo that would have modified
> those pages will just be skipped, thus dodging the problem.

I think your understanding is right.

> In RelationTruncate, I suggest that we ought to clear the
> delay-checkpoint flag before rather than after calling
> FreeSpaceMapVacuumRange. Since the free space map is not fully
> WAL-logged, anything we're doing there should be non-critical. Also, I

Agreed and fixed.

> think it might be better if MarkBufferDirtyHint stays closer to the
> existing coding and just uses a Boolean and an if-test to decide
> whether to clear the bit, instead of inventing a new mechanism. I
> don't really see anything wrong with the new mechanism, but I think
> it's better to keep the patch minimal.

Yeah, that was a a kind of silly. Fixed.

> As you say, this doesn't fix the problem that truncation might fail.
> But as Andres and Sawada-san said, the solution to that is to get rid
> of the comments saying that it's OK for truncation to fail and make it
> a PANIC. However, I don't think that change needs to be part of this
> patch. Even if we do that, we still need to do this. And even if we do
> this, we still need to do that.

Ok. Addition to the aboves, I rewrote the comment in RelatinoTruncate.

+ * Delay the concurrent checkpoint's completion until this truncation
+ * successfully completes, so that we don't establish a redo-point between
+ * buffer deletion and file-truncate. Otherwise we can leave inconsistent
+ * file content against the WAL records after the REDO position and future
+ * recovery fails.

However, a problem for me for now is that I cannot reproduce the
problem.

To avoid further confusion, the attached is named as *.patch.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

Attachment Content-Type Size
delay_chkpt_cmpl_after_trunc_sucess_v3.patch text/x-patch 13.1 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Amit Kapila 2021-09-27 08:57:07 Re: Timeout failure in 019_replslot_limit.pl
Previous Message Kyotaro Horiguchi 2021-09-27 08:28:10 Re: Corruption during WAL replay