Re: Corruption during WAL replay

From: Daniel Shelepanov <deniel1495(at)mail(dot)ru>
To: Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>, robertmhaas(at)gmail(dot)com
Cc: ibrar(dot)ahmad(at)gmail(dot)com, tejeswarm(at)hotmail(dot)com, andres(at)anarazel(dot)de, hlinnaka(at)iki(dot)fi, masahiko(dot)sawada(at)2ndquadrant(dot)com, pgsql-hackers(at)postgresql(dot)org, hexexpert(at)comcast(dot)net
Subject: Re: Corruption during WAL replay
Date: 2022-01-24 20:33:20
Message-ID: 0c911453-a609-7814-f8de-f6c2cf9c5176@mail.ru
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers


On 27.09.2021 11:30, Kyotaro Horiguchi wrote:
> Thank you for the comments! (Sorry for the late resopnse.)
>
> At Tue, 10 Aug 2021 14:14:05 -0400, Robert Haas <robertmhaas(at)gmail(dot)com> wrote in
>> On Thu, Mar 4, 2021 at 10:01 PM Kyotaro Horiguchi
>> <horikyota(dot)ntt(at)gmail(dot)com> wrote:
>>> The patch assumed that CHKPT_START/COMPLETE barrier are exclusively
>>> used each other, but MarkBufferDirtyHint which delays checkpoint start
>>> is called in RelationTruncate while delaying checkpoint completion.
>>> That is not a strange nor harmful behavior. I changed delayChkpt to a
>>> bitmap integer from an enum so that both barrier are separately
>>> triggered.
>>>
>>> I'm not sure this is the way to go here, though. This fixes the issue
>>> of a crash during RelationTruncate, but the issue of smgrtruncate
>>> failure during RelationTruncate still remains (unless we treat that
>>> failure as PANIC?).
>> I like this patch. As I understand it, we're currently cheating by
>> allowing checkpoints to complete without necessarily flushing all of
>> the pages that were dirty at the time we fixed the redo pointer out to
>> disk. We think this is OK because we know that those pages are going
>> to get truncated away, but it's not really OK because when the system
>> starts up, it has to replay WAL starting from the checkpoint's redo
>> pointer, but the state of the page is not the same as it was at the
>> time when the redo pointer was the end of WAL, so redo fails. In the
>> case described in
>> http://postgr.es/m/BYAPR06MB63739B2692DC6DBB3C5F186CABDA0@BYAPR06MB6373.namprd06.prod.outlook.com
>> modifications are made to the page before the redo pointer is fixed
>> and those changes never make it to disk, but the truncation also never
>> makes it to the disk either. With this patch, that can't happen,
>> because no checkpoint can intervene between when we (1) decide we're
>> not going to bother writing those dirty pages and (2) actually
>> truncate them away. So either the pages will get written as part of
>> the checkpoint, or else they'll be gone before the checkpoint
>> completes. In the latter case, I suppose redo that would have modified
>> those pages will just be skipped, thus dodging the problem.
> I think your understanding is right.
>
>> In RelationTruncate, I suggest that we ought to clear the
>> delay-checkpoint flag before rather than after calling
>> FreeSpaceMapVacuumRange. Since the free space map is not fully
>> WAL-logged, anything we're doing there should be non-critical. Also, I
> Agreed and fixed.
>
>> think it might be better if MarkBufferDirtyHint stays closer to the
>> existing coding and just uses a Boolean and an if-test to decide
>> whether to clear the bit, instead of inventing a new mechanism. I
>> don't really see anything wrong with the new mechanism, but I think
>> it's better to keep the patch minimal.
> Yeah, that was a a kind of silly. Fixed.
>
>> As you say, this doesn't fix the problem that truncation might fail.
>> But as Andres and Sawada-san said, the solution to that is to get rid
>> of the comments saying that it's OK for truncation to fail and make it
>> a PANIC. However, I don't think that change needs to be part of this
>> patch. Even if we do that, we still need to do this. And even if we do
>> this, we still need to do that.
> Ok. Addition to the aboves, I rewrote the comment in RelatinoTruncate.
>
> + * Delay the concurrent checkpoint's completion until this truncation
> + * successfully completes, so that we don't establish a redo-point between
> + * buffer deletion and file-truncate. Otherwise we can leave inconsistent
> + * file content against the WAL records after the REDO position and future
> + * recovery fails.
>
> However, a problem for me for now is that I cannot reproduce the
> problem.
>
> To avoid further confusion, the attached is named as *.patch.
>
> regards.
>
Hi. This is my first attempt to review a patch so feel free to tell me
if I missed something.

As of today's state of REL_14_STABLE
(ef9706bbc8ce917a366e4640df8c603c9605817a), the problem is reproducible
using the script provided by Daniel Wood in this
(1335373813(dot)287510(dot)1573611814107(at)connect(dot)xfinity(dot)com) message. Also, the
latest patch seems not to be applicable and requires some minor tweaks.

Regards,

Daniel Shelepanov

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Andres Freund 2022-01-24 20:35:04 TAP tests, directories and parameters
Previous Message Robert Haas 2022-01-24 20:33:01 Re: CREATEROLE and role ownership hierarchies