Quick Links

Re: new option to allow pg_rewind to run without full_page_writes

From:	Jérémie Grauer <jeremie(dot)grauer(at)cosium(dot)com>
To:	Andres Freund <andres(at)anarazel(dot)de>
Cc:	pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject:	Re: new option to allow pg_rewind to run without full_page_writes
Date:	2022-11-07 23:07:09
Message-ID:	e3184432-54b8-5420-f2a0-b26e7a4652e0@cosium.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

Hello,

First, thank you for reviewing.

ZFS writes files in increment of its configured recordsize for the
current filesystem dataset.

So with a recordsize configured to be a multiple of 8K, you can't get
torn pages on writes, that's why full_page_writes can be safely
deactivated on ZFS (the usual advice is to configure ZFS with a
recordsize of 8K for postgres, but on some workloads, it can actually be
beneficial to go to a higher multiple of 8K).

On 06/11/2022 03:38, Andres Freund wrote:
> Hi,
>
> On 2022-11-03 16:54:13 +0100, Jérémie Grauer wrote:
>> Currently pg_rewind refuses to run if full_page_writes is off. This is to
>> prevent it to run into a torn page during operation.
>>
>> This is usually a good call, but some file systems like ZFS are naturally
>> immune to torn page (maybe btrfs too, but I don't know for sure for this
>> one).
>
> Note that this isn't about torn pages in case of crashes, but about reading
> pages while they're being written to.
Like I wrote above, ZFS will prevent torn pages on writes, like
full_page_writes does.
>
> Right now, that definitely allows for torn reads, because of the way
> pg_read_binary_file() is implemented. We only ensure a 4k read size from the
> view of our code, which obviously can lead to torn 8k page reads, no matter
> what the filesystem guarantees.
>
> Also, for reasons I don't understand we use C streaming IO or
> pg_read_binary_file(), so you'd also need to ensure that the buffer size used
> by the stream implementation can't cause the reads to happen in smaller
> chunks. Afaict we really shouldn't use file streams here, then we'd at least
> have control over that aspect.
>
>
> Does ZFS actually guarantee that there never can be short reads? As soon as
> they are possible, full page writes are neededI may be missing something here: how does full_page_writes prevents
short _reads_ ?

Presumably, if we do something like read the first 4K of a file, then
change the file, then read the next 4K, the second 4K may be a torn
read. But I fail to see how full_page_writes prevents this since it only
act on writes>
> This isn't an fundamental issue - we could have a version of
> pg_read_binary_file() for relation data that prevents the page being written
> out concurrently by locking the buffer page. In addition it could often avoid
> needing to read the page from the OS / disk, if present in shared buffers
> (perhaps minus cases where we haven't flushed the WAL yet, but we could also
> flush the WAL in those).
>I agree, but this would need a differen patch, which may be beyond my
skills.
> Greetings,
>
> Andres Freund
Anyway, ZFS will act like full_page_writes is always active, so isn't
the proposed modification to pg_rewind valid?

You'll find attached a second version of the patch, which is cleaner
(removed double negation).

Regards,
Jérémie Grauer

Attachment	Content-Type	Size
v2-0001-adds-the-option-no-ensure-full-page-writes-to-pg_rew.patch	text/x-patch	9.1 KB

In response to

Re: new option to allow pg_rewind to run without full_page_writes at 2022-11-06 02:38:19 from Andres Freund

Responses

Re: new option to allow pg_rewind to run without full_page_writes at 2022-11-08 00:04:36 from Thomas Munro
Re: new option to allow pg_rewind to run without full_page_writes at 2022-11-08 00:34:20 from Andres Freund

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Thomas Munro	2022-11-08 00:04:36	Re: new option to allow pg_rewind to run without full_page_writes
Previous Message	David Christensen	2022-11-07 23:01:01	Re: [PATCH] Teach pg_waldump to extract FPIs from the WAL