Re: Corruption during WAL replay

From: Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
To: Teja Mupparti <tejeswarm(at)hotmail(dot)com>, Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>, "masahiko(dot)sawada(at)2ndquadrant(dot)com" <masahiko(dot)sawada(at)2ndquadrant(dot)com>
Cc: "andres(at)anarazel(dot)de" <andres(at)anarazel(dot)de>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>, "hexexpert(at)comcast(dot)net" <hexexpert(at)comcast(dot)net>
Subject: Re: Corruption during WAL replay
Date: 2020-08-17 11:05:37
Message-ID: 7555d464-5bf7-7d6f-febd-0df9bcffe2a7@iki.fi
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 14/04/2020 22:04, Teja Mupparti wrote:
> Thanks Kyotaro and Masahiko for the feedback. I think there is a
> consensus on the critical-section around truncate,

+1

> but I just want to emphasize the need for reversing the order of the
> dropping the buffers and the truncation.
>
>  Repro details (when full page write = off)
>
>          1) Page on disk has empty LP 1, Insert into page LP 1
>          2) checkpoint START (Recovery REDO eventually starts here)
>          3) Delete all rows on the page (page is empty now)
>          4) Autovacuum kicks in and truncates the pages
>                  DropRelFileNodeBuffers - Dirty page NOT written, LP 1
> on disk still empty
>          5) Checkpoint completes
>          6) Crash
>          7) smgrtruncate - Not reached (this is where we do the
> physical truncate)
>
>  Now the crash-recovery starts
>
>          Delete-log-replay (above step-3) reads page with empty LP 1
> and the delete fails with PANIC (old page on disk with no insert)
>
> Doing recovery, truncate is even not reached, a WAL replay of the
> truncation will happen in the future but the recovery fails (repeatedly)
> even before reaching that point.

Hmm. I think simply reversing the order of DropRelFileNodeBuffers() and
truncating the file would open a different issue:

1) Page on disk has empty LP 1, Insert into page LP 1
2) checkpoint START (Recovery REDO eventually starts here)
3) Delete all rows on the page (page is empty now)
4) Autovacuum kicks in and starts truncating
5) smgrtruncate() truncates the file
6) checkpoint writes out buffers for pages that were just truncated
away, expanding the file again.

Your patch had a mechanism to mark the buffers as io-in-progress before
truncating the file to fix that, but I'm wary of that approach. Firstly,
it requires scanning the buffers that are dropped twice, which can take
a long time. I remember that people have already complained that
DropRelFileNodeBuffers() is slow, when it has to scan all the buffers
once. More importantly, abusing the BM_IO_INPROGRESS flag for this seems
bad. For starters, because you're not holding buffer's I/O lock, I
believe the checkpointer would busy-wait on the buffers until the
truncation has completed. See StartBufferIO() and AbortBufferIO().

Perhaps a better approach would be to prevent the checkpoint from
completing, until all in-progress truncations have completed. We have a
mechanism to wait out in-progress commits at the beginning of a
checkpoint, right after the redo point has been established. See
comments around the GetVirtualXIDsDelayingChkpt() function call in
CreateCheckPoint(). We could have a similar mechanism to wait out the
truncations before *completing* a checkpoint.

- Heikki

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Ibrar Ahmed 2020-08-17 11:26:40 Re: COPY FREEZE and setting PD_ALL_VISIBLE/visibility map bits
Previous Message Thomas Munro 2020-08-17 11:00:34 Optimising compactify_tuples()