Re: new option to allow pg_rewind to run without full_page_writes

From: Andres Freund <andres(at)anarazel(dot)de>
To: Jérémie Grauer <jeremie(dot)grauer(at)cosium(dot)com>
Cc: pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject: Re: new option to allow pg_rewind to run without full_page_writes
Date: 2022-11-06 02:38:19
Message-ID: 20221106023819.tpmvqa6kuy4cvtc7@awork3.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

On 2022-11-03 16:54:13 +0100, Jérémie Grauer wrote:
> Currently pg_rewind refuses to run if full_page_writes is off. This is to
> prevent it to run into a torn page during operation.
>
> This is usually a good call, but some file systems like ZFS are naturally
> immune to torn page (maybe btrfs too, but I don't know for sure for this
> one).

Note that this isn't about torn pages in case of crashes, but about reading
pages while they're being written to.

Right now, that definitely allows for torn reads, because of the way
pg_read_binary_file() is implemented. We only ensure a 4k read size from the
view of our code, which obviously can lead to torn 8k page reads, no matter
what the filesystem guarantees.

Also, for reasons I don't understand we use C streaming IO or
pg_read_binary_file(), so you'd also need to ensure that the buffer size used
by the stream implementation can't cause the reads to happen in smaller
chunks. Afaict we really shouldn't use file streams here, then we'd at least
have control over that aspect.

Does ZFS actually guarantee that there never can be short reads? As soon as
they are possible, full page writes are needed.

This isn't an fundamental issue - we could have a version of
pg_read_binary_file() for relation data that prevents the page being written
out concurrently by locking the buffer page. In addition it could often avoid
needing to read the page from the OS / disk, if present in shared buffers
(perhaps minus cases where we haven't flushed the WAL yet, but we could also
flush the WAL in those).

Greetings,

Andres Freund

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2022-11-06 05:12:10 Re: explain analyze rows=%.0f
Previous Message Nathan Bossart 2022-11-05 23:01:15 Re: Suppressing useless wakeups in walreceiver