Re: Corruption during WAL replay

From: Teja Mupparti <tejeswarm(at)hotmail(dot)com>
To: Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>, "masahiko(dot)sawada(at)2ndquadrant(dot)com" <masahiko(dot)sawada(at)2ndquadrant(dot)com>
Cc: "andres(at)anarazel(dot)de" <andres(at)anarazel(dot)de>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>, "hexexpert(at)comcast(dot)net" <hexexpert(at)comcast(dot)net>
Subject: Re: Corruption during WAL replay
Date: 2020-04-14 19:04:07
Message-ID: BYAPR06MB63739B2692DC6DBB3C5F186CABDA0@BYAPR06MB6373.namprd06.prod.outlook.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Thanks Kyotaro and Masahiko for the feedback. I think there is a consensus on the critical-section around truncate, but I just want to emphasize the need for reversing the order of the dropping the buffers and the truncation.

Repro details (when full page write = off)

1) Page on disk has empty LP 1, Insert into page LP 1
2) checkpoint START (Recovery REDO eventually starts here)
3) Delete all rows on the page (page is empty now)
4) Autovacuum kicks in and truncates the pages
DropRelFileNodeBuffers - Dirty page NOT written, LP 1 on disk still empty
5) Checkpoint completes
6) Crash
7) smgrtruncate - Not reached (this is where we do the physical truncate)

Now the crash-recovery starts

Delete-log-replay (above step-3) reads page with empty LP 1 and the delete fails with PANIC (old page on disk with no insert)

Doing recovery, truncate is even not reached, a WAL replay of the truncation will happen in the future but the recovery fails (repeatedly) even before reaching that point.

Best regards,
Teja

________________________________
From: Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>
Sent: Monday, April 13, 2020 7:35 PM
To: masahiko(dot)sawada(at)2ndquadrant(dot)com <masahiko(dot)sawada(at)2ndquadrant(dot)com>
Cc: andres(at)anarazel(dot)de <andres(at)anarazel(dot)de>; tejeswarm(at)hotmail(dot)com <tejeswarm(at)hotmail(dot)com>; pgsql-hackers(at)postgresql(dot)org <pgsql-hackers(at)postgresql(dot)org>; hexexpert(at)comcast(dot)net <hexexpert(at)comcast(dot)net>
Subject: Re: Corruption during WAL replay

At Mon, 13 Apr 2020 18:53:26 +0900, Masahiko Sawada <masahiko(dot)sawada(at)2ndquadrant(dot)com> wrote in
> On Mon, 13 Apr 2020 at 17:40, Andres Freund <andres(at)anarazel(dot)de> wrote:
> >
> > Hi,
> >
> > On 2020-04-13 15:24:55 +0900, Masahiko Sawada wrote:
> > > On Sat, 11 Apr 2020 at 09:00, Teja Mupparti <tejeswarm(at)hotmail(dot)com> wrote:
> > > /*
> > > * We WAL-log the truncation before actually truncating, which means
> > > * trouble if the truncation fails. If we then crash, the WAL replay
> > > * likely isn't going to succeed in the truncation either, and cause a
> > > * PANIC. It's tempting to put a critical section here, but that cure
> > > * would be worse than the disease. It would turn a usually harmless
> > > * failure to truncate, that might spell trouble at WAL replay, into a
> > > * certain PANIC.
> > > */
> >
> > Yea, but that reasoning is just plain *wrong*. It's *never* ok to WAL
> > log something and then not perform the action. This leads to to primary
> > / replica getting out of sync, crash recovery potentially not completing
> > (because of records referencing the should-be-truncated pages), ...

It is introduced in 2008 by 3396000684, for 8.4. So it can be said as
an overlook when introducing log-shipping.

The reason other operations like INSERTs (that extends the underlying
file) are "safe" after an extension failure is the following
operations are performed in shared buffers as if the new page exists,
then tries to extend the file again. So if we continue working after
truncation failure, we need to disguise on shared buffers as if the
truncated pages are gone. But we don't have a room for another flag
in buffer header. For example, BM_DIRTY && !BM_VALID might be able to
be used as the state that the page should have been truncated but not
succeeded yet, but I'm not sure.

Anyway, I think the prognosis of a truncation failure is far hopeless
than extension failure in most cases and I doubt that it's good to
introduce such a complex feature only to overcome such a hopeless
situation.

In short, I think we should PANIC in that case.

> > > As a second idea, I wonder if we can defer truncation until commit
> > > time like smgrDoPendingDeletes mechanism. The sequence would be:
> >
> > This is mostly an issue during [auto]vacuum partially truncating the end
> > of the file. We intentionally release the AEL regularly to allow other
> > accesses to continue.
> >
> > For transactional truncations we don't go down this path (as we create a
> > new relfilenode).
> >
> >
> > > At RelationTruncate(),
> > > 1. WAL logging.
> > > 2. Remember buffers to be dropped.
> >
> > You definitely cannot do that, as explained above.
>
> Ah yes, you're right.
>
> So it seems to me currently what we can do for this issue would be to
> enclose the truncation operation in a critical section. IIUC it's not
> enough just to reverse the order of dropping buffers and physical file
> truncation because it cannot solve the problem of inconsistency on the
> standby. And as Horiguchi-san mentioned, there is no need to reverse
> that order if we envelop the truncation operation by a critical
> section because we can recover page changes during crash recovery. The
> strategy of writing out all dirty buffers before dropping buffers,
> proposed as (a) in [1], also seems not enough.

Agreed. Since it's not acceptable ether WAL-logging->not-performing
nor performing->WAL-logging, there's no other way than working as if
truncation is succeeded (and try again) even if it actually
failed. But it would be too complex.

Just making it a critical section seems the right thing here.

> [1] https://www.postgresql.org/message-id/20191207001232.klidxnm756wqxvwx%40alap3.anarazel.de
> Doing sync before truncation

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Robert Haas 2020-04-14 19:13:39 Re: pg_basebackup, manifests and backends older than ~12
Previous Message Andrew Dunstan 2020-04-14 19:03:12 Re: documenting the backup manifest file format