Re: Sketch of a fix for that truncation data corruption issue

From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: Sketch of a fix for that truncation data corruption issue
Date: 2018-12-12 01:49:59
Message-ID: CA+Tgmoava0aCNObbT3OMti0c4hmokDr9GSc=uffjxn6Q9DTBMA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Wed, Dec 12, 2018 at 6:08 AM Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> Well, if *you're* willing to entertain that possiblity, I'm on board.
> That would certainly lead to a much simpler, and probably back-patchable,
> fix.

I think we should, then. Simple is good.

Just thinking about this a bit, the problem with truncating first and
then writing the WAL record is that if the WAL record never makes it
to disk, any physical standbys will end up out of sync with the
master, leading to disaster. But the problem with writing the WAL
record first is that the actual operation might fail, and then
standbys will end up out of sync with the master, leading to disaster.
The obvious way to finesse that latter problem is just PANIC if
ftruncate() fails -- then we'll crash restart and retry, and if we
still can't do it, well, the DBA will have to fix that before the
system can come on line. I'm not sure that's really all that bad --
if we can't truncate, we're kinda hosed. How, other than a
permissions problem, does that even happen?

Your sketch upthread tries to fix it another way -- write a second
record that says essentially "never mind". But that leads to the
master and the standby not really being in quite equivalent states.
I'm not sure whether that's really OK. If any future operation on the
master depends on some aspects of the page state that wasn't recreated
exactly on the standby, then replay will run into trouble.

I wonder if we could get away with defining a truncation event as
setting all pages beyond the truncation point to all-zeroes, with the
number of those pages that actually exist at the filesystem level as
an accidental detail. So if the master can't ftruncate(), it's also
OK if it just zeroes all the buffers beyond that point. But once it
emits the WAL record, it must do one or the other, or else PANIC. The
standby has the same options.

> > Truncating relations isn't that common of an
> > operation, and also, we could mitigate the impacts by having the scan
> > that identifies the truncation point also write any dirty buffers
> > after that point. We'd have to recheck after upgrading our relation
> > lock, but odds are good that in the normal case we wouldn't add much
> > to the time when we hold the stronger lock.
>
> Hm, not quite following this? We have to lock out writers before we
> try to identify the truncation point.

I thought we made a tentative identification of the truncation point,
upgrade the lock, and then rechecked.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Andres Freund 2018-12-12 01:54:15 Re: Sketch of a fix for that truncation data corruption issue
Previous Message Michael Paquier 2018-12-12 01:48:25 Re: Add pg_partition_root to get top-most parent of a partition tree