Re: Corruption during WAL replay

From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>
Cc: Ibrar Ahmed <ibrar(dot)ahmad(at)gmail(dot)com>, tejeswarm(at)hotmail(dot)com, Andres Freund <andres(at)anarazel(dot)de>, hlinnaka <hlinnaka(at)iki(dot)fi>, Masahiko Sawada <masahiko(dot)sawada(at)2ndquadrant(dot)com>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>, Daniel Wood <hexexpert(at)comcast(dot)net>
Subject: Re: Corruption during WAL replay
Date: 2021-08-10 18:14:05
Message-ID: CA+Tgmobwc_Rdaw+6TupT4_g9z55JjL=vhwpphsQe=YmBN0OPDg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Thu, Mar 4, 2021 at 10:01 PM Kyotaro Horiguchi
<horikyota(dot)ntt(at)gmail(dot)com> wrote:
> The patch assumed that CHKPT_START/COMPLETE barrier are exclusively
> used each other, but MarkBufferDirtyHint which delays checkpoint start
> is called in RelationTruncate while delaying checkpoint completion.
> That is not a strange nor harmful behavior. I changed delayChkpt to a
> bitmap integer from an enum so that both barrier are separately
> triggered.
>
> I'm not sure this is the way to go here, though. This fixes the issue
> of a crash during RelationTruncate, but the issue of smgrtruncate
> failure during RelationTruncate still remains (unless we treat that
> failure as PANIC?).

I like this patch. As I understand it, we're currently cheating by
allowing checkpoints to complete without necessarily flushing all of
the pages that were dirty at the time we fixed the redo pointer out to
disk. We think this is OK because we know that those pages are going
to get truncated away, but it's not really OK because when the system
starts up, it has to replay WAL starting from the checkpoint's redo
pointer, but the state of the page is not the same as it was at the
time when the redo pointer was the end of WAL, so redo fails. In the
case described in
http://postgr.es/m/BYAPR06MB63739B2692DC6DBB3C5F186CABDA0@BYAPR06MB6373.namprd06.prod.outlook.com
modifications are made to the page before the redo pointer is fixed
and those changes never make it to disk, but the truncation also never
makes it to the disk either. With this patch, that can't happen,
because no checkpoint can intervene between when we (1) decide we're
not going to bother writing those dirty pages and (2) actually
truncate them away. So either the pages will get written as part of
the checkpoint, or else they'll be gone before the checkpoint
completes. In the latter case, I suppose redo that would have modified
those pages will just be skipped, thus dodging the problem.

In RelationTruncate, I suggest that we ought to clear the
delay-checkpoint flag before rather than after calling
FreeSpaceMapVacuumRange. Since the free space map is not fully
WAL-logged, anything we're doing there should be non-critical. Also, I
think it might be better if MarkBufferDirtyHint stays closer to the
existing coding and just uses a Boolean and an if-test to decide
whether to clear the bit, instead of inventing a new mechanism. I
don't really see anything wrong with the new mechanism, but I think
it's better to keep the patch minimal.

As you say, this doesn't fix the problem that truncation might fail.
But as Andres and Sawada-san said, the solution to that is to get rid
of the comments saying that it's OK for truncation to fail and make it
a PANIC. However, I don't think that change needs to be part of this
patch. Even if we do that, we still need to do this. And even if we do
this, we still need to do that.

--
Robert Haas
EDB: http://www.enterprisedb.com

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Michael Meskes 2021-08-10 18:16:49 Re: ECPG bug fix: DECALRE STATEMENT and DEALLOCATE, DESCRIBE
Previous Message Bruce Momjian 2021-08-10 18:07:00 Re: ECPG bug fix: DECALRE STATEMENT and DEALLOCATE, DESCRIBE