Quick Links

Re: Corruption during WAL replay

From:	Andres Freund <andres(at)anarazel(dot)de>
To:	Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
Cc:	Teja Mupparti <tejeswarm(at)hotmail(dot)com>, Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>, "masahiko(dot)sawada(at)2ndquadrant(dot)com" <masahiko(dot)sawada(at)2ndquadrant(dot)com>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>, "hexexpert(at)comcast(dot)net" <hexexpert(at)comcast(dot)net>
Subject:	Re: Corruption during WAL replay
Date:	2020-08-17 18:22:15
Message-ID:	20200817182215.pciaubofr6iujiuo@alap3.anarazel.de
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

Hi,

On 2020-08-17 14:05:37 +0300, Heikki Linnakangas wrote:
> On 14/04/2020 22:04, Teja Mupparti wrote:
> > Thanks Kyotaro and Masahiko for the feedback. I think there is a
> > consensus on the critical-section around truncate,
>
> +1

I'm inclined to think that we should do that independent of the far more
complicated fix for other related issues.

> > but I just want to emphasize the need for reversing the order of the
> > dropping the buffers and the truncation.
> >
> > Repro details (when full page write = off)
> >
> > 1) Page on disk has empty LP 1, Insert into page LP 1
> > 2) checkpoint START (Recovery REDO eventually starts here)
> > 3) Delete all rows on the page (page is empty now)
> > 4) Autovacuum kicks in and truncates the pages
> > DropRelFileNodeBuffers - Dirty page NOT written, LP 1
> > on disk still empty
> > 5) Checkpoint completes
> > 6) Crash
> > 7) smgrtruncate - Not reached (this is where we do the
> > physical truncate)
> >
> > Now the crash-recovery starts
> >
> > Delete-log-replay (above step-3) reads page with empty LP 1
> > and the delete fails with PANIC (old page on disk with no insert)
> >
> > Doing recovery, truncate is even not reached, a WAL replay of the
> > truncation will happen in the future but the recovery fails (repeatedly)
> > even before reaching that point.
>
> Hmm. I think simply reversing the order of DropRelFileNodeBuffers() and
> truncating the file would open a different issue:
>
> 1) Page on disk has empty LP 1, Insert into page LP 1
> 2) checkpoint START (Recovery REDO eventually starts here)
> 3) Delete all rows on the page (page is empty now)
> 4) Autovacuum kicks in and starts truncating
> 5) smgrtruncate() truncates the file
> 6) checkpoint writes out buffers for pages that were just truncated away,
> expanding the file again.
>
> Your patch had a mechanism to mark the buffers as io-in-progress before
> truncating the file to fix that, but I'm wary of that approach. Firstly, it
> requires scanning the buffers that are dropped twice, which can take a long
> time.

I was thinking that we'd keep track of all the buffers marked as "in
progress" that way, avoiding the second scan.

It's also worth keeping in mind that this code is really only relevant
for partial truncations, which don't happen at the same frequency as
transactional truncations.

> I remember that people have already complained that
> DropRelFileNodeBuffers() is slow, when it has to scan all the buffers
> once.

But that's when dropping many relations, normally. E.g. at the end of a
regression test.

> More importantly, abusing the BM_IO_INPROGRESS flag for this seems
> bad. For starters, because you're not holding buffer's I/O lock, I
> believe the checkpointer would busy-wait on the buffers until the
> truncation has completed. See StartBufferIO() and AbortBufferIO().

I think we should apply Robert's patch that makes io locks into
condition variables. Then we can fairly easily have many many buffers io
locked. Obviously there's some issues with doing so in the back
branches :(

I'm working on an AIO branch, and that also requires to be able to mark
multiple buffers as in-progress, FWIW.

> Perhaps a better approach would be to prevent the checkpoint from
> completing, until all in-progress truncations have completed. We have a
> mechanism to wait out in-progress commits at the beginning of a checkpoint,
> right after the redo point has been established. See comments around the
> GetVirtualXIDsDelayingChkpt() function call in CreateCheckPoint(). We could
> have a similar mechanism to wait out the truncations before *completing* a
> checkpoint.

What I outlined earlier *is* essentially a way to do so, by preventing
checkpointing from finishing the buffer scan while a dangerous state
exists.

Greetings,

Andres Freund

In response to

Re: Corruption during WAL replay at 2020-08-17 11:05:37 from Heikki Linnakangas

Responses

Re: Corruption during WAL replay at 2020-10-30 16:34:12 from Anastasia Lubennikova
Re: Corruption during WAL replay at 2020-11-06 11:40:54 from Masahiko Sawada
Re: Corruption during WAL replay at 2021-01-06 08:33:27 from Kyotaro Horiguchi

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Bruce Momjian	2020-08-17 18:23:57	Re: EDB builds Postgres 13 with an obsolete ICU version
Previous Message	Andres Freund	2020-08-17 18:11:47	Re: run pgindent on a regular basis / scripted manner