Re: Corruption during WAL replay

From: Andres Freund <andres(at)anarazel(dot)de>
To: Masahiko Sawada <masahiko(dot)sawada(at)2ndquadrant(dot)com>
Cc: Teja Mupparti <tejeswarm(at)hotmail(dot)com>, Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>, "hexexpert(at)comcast(dot)net" <hexexpert(at)comcast(dot)net>
Subject: Re: Corruption during WAL replay
Date: 2020-04-13 08:40:31
Message-ID: 20200413084031.krnrhh74qtywsz2f@alap3.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

On 2020-04-13 15:24:55 +0900, Masahiko Sawada wrote:
> On Sat, 11 Apr 2020 at 09:00, Teja Mupparti <tejeswarm(at)hotmail(dot)com> wrote:
> >
> > Thanks Andres and Kyotaro for the quick review. I have fixed the typos and also included the critical section (emulated it with try-catch block since palloc()s are causing issues in the truncate code). This time I used git format-patch.
> >
>
> I briefly looked at the latest patch but I'm not sure it's the right
> thing here to use PG_TRY/PG_CATCH to report the PANIC error. For
> example, with the following code you changed, we will always end up
> with emitting a PANIC "failed to truncate the relation" regardless of
> the actual cause of the error.
>
> + PG_CATCH();
> + {
> + ereport(PANIC, (errcode(ERRCODE_INTERNAL_ERROR),
> + errmsg("failed to truncate the relation")));
> + }
> + PG_END_TRY();
>
> And the comments of RelationTruncate() mentions:

I think that's just a workaround for mdtruncate not being usable in
critical sections.

> /*
> * We WAL-log the truncation before actually truncating, which means
> * trouble if the truncation fails. If we then crash, the WAL replay
> * likely isn't going to succeed in the truncation either, and cause a
> * PANIC. It's tempting to put a critical section here, but that cure
> * would be worse than the disease. It would turn a usually harmless
> * failure to truncate, that might spell trouble at WAL replay, into a
> * certain PANIC.
> */

Yea, but that reasoning is just plain *wrong*. It's *never* ok to WAL
log something and then not perform the action. This leads to to primary
/ replica getting out of sync, crash recovery potentially not completing
(because of records referencing the should-be-truncated pages), ...

> As a second idea, I wonder if we can defer truncation until commit
> time like smgrDoPendingDeletes mechanism. The sequence would be:

This is mostly an issue during [auto]vacuum partially truncating the end
of the file. We intentionally release the AEL regularly to allow other
accesses to continue.

For transactional truncations we don't go down this path (as we create a
new relfilenode).

> At RelationTruncate(),
> 1. WAL logging.
> 2. Remember buffers to be dropped.

You definitely cannot do that, as explained above.

> At CommitTransaction(),
> 3. Revisit the remembered buffers to check if the buffer still has
> table data that needs to be truncated.
> 4-a, If it has, we mark it as IO_IN_PROGRESS.
> 4-b, If it already has different table data, ignore it.
> 5, Truncate physical files.
> 6, Mark the buffer we marked at #4-a as invalid.
>
> If an error occurs between #3 and #6 or in abort case, we revert all
> IO_IN_PROGRESS flags on the buffers.

What would this help with? If we still need the more complicated
truncation sequence?

Greetings,

Andres Freund

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Kyotaro Horiguchi 2020-04-13 08:49:24 Re: pg_basebackup, manifests and backends older than ~12
Previous Message tushar 2020-04-13 08:37:14 Re: [Proposal] Global temporary tables