Re: new option to allow pg_rewind to run without full_page_writes

From: Andres Freund <andres(at)anarazel(dot)de>
To: Jérémie Grauer <jeremie(dot)grauer(at)cosium(dot)com>
Cc: pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject: Re: new option to allow pg_rewind to run without full_page_writes
Date: 2022-11-08 00:34:20
Message-ID: 20221108003420.tvmo367pjwxynnf4@awork3.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

On 2022-11-08 00:07:09 +0100, Jérémie Grauer wrote:
> On 06/11/2022 03:38, Andres Freund wrote:
> > Hi,
> >
> > On 2022-11-03 16:54:13 +0100, Jérémie Grauer wrote:
> > > Currently pg_rewind refuses to run if full_page_writes is off. This is to
> > > prevent it to run into a torn page during operation.
> > >
> > > This is usually a good call, but some file systems like ZFS are naturally
> > > immune to torn page (maybe btrfs too, but I don't know for sure for this
> > > one).
> >
> > Note that this isn't about torn pages in case of crashes, but about reading
> > pages while they're being written to.

> Like I wrote above, ZFS will prevent torn pages on writes, like
> full_page_writes does.

Unfortunately not relevant for pg_rewind due to the issues mentioned
subsequently.

> > Right now, that definitely allows for torn reads, because of the way
> > pg_read_binary_file() is implemented. We only ensure a 4k read size from the
> > view of our code, which obviously can lead to torn 8k page reads, no matter
> > what the filesystem guarantees.
> >
> > Also, for reasons I don't understand we use C streaming IO or
> > pg_read_binary_file(), so you'd also need to ensure that the buffer size used
> > by the stream implementation can't cause the reads to happen in smaller
> > chunks. Afaict we really shouldn't use file streams here, then we'd at least
> > have control over that aspect.
> >
> >
> > Does ZFS actually guarantee that there never can be short reads? As soon as
> > they are possible, full page writes are neededI may be missing something
> > here: how does full_page_writes prevents

> short _reads_ ?

Yes.

> Presumably, if we do something like read the first 4K of a file, then change
> the file, then read the next 4K, the second 4K may be a torn read.

Correct.

> But I fail to see how full_page_writes prevents this since it only act on writes

It ensures the damage is later repaired during WAL replay. Which can only
happen if the WAL contains the necessary information to do so - the full page
writes.

I suspect to avoid the need for this we'd need to atomically read all the
pages involved in a WAL record (presumably by locking the pages against
IO). That'd then safely allow skipping replay of WAL records based on the LSN.a

A slightly easier thing would be to force-enable full page writes just for the
duration of a rewind, similar to what we do during base backups. But that'd
still require a bunch more work than done here.

> > This isn't an fundamental issue - we could have a version of
> > pg_read_binary_file() for relation data that prevents the page being written
> > out concurrently by locking the buffer page. In addition it could often avoid
> > needing to read the page from the OS / disk, if present in shared buffers
> > (perhaps minus cases where we haven't flushed the WAL yet, but we could also
> > flush the WAL in those).
> > I agree, but this would need a differen patch, which may be beyond my
> skills.
> > Greetings,
> >
> > Andres Freund
> Anyway, ZFS will act like full_page_writes is always active, so isn't the
> proposed modification to pg_rewind valid?

No. This really isn't about the crash safety aspects of full page writes, so
the fact that ZFS is used is just not really relevant.

Regards,

Andres

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Peter Geoghegan 2022-11-08 00:34:27 Re: Allow single table VACUUM in transaction block
Previous Message Peter Smith 2022-11-08 00:18:36 Re: [DOCS] Stats views and functions not in order?