Re: new option to allow pg_rewind to run without full_page_writes

From: Thomas Munro <thomas(dot)munro(at)gmail(dot)com>
To: Jérémie Grauer <jeremie(dot)grauer(at)cosium(dot)com>
Cc: Andres Freund <andres(at)anarazel(dot)de>, pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject: Re: new option to allow pg_rewind to run without full_page_writes
Date: 2022-11-08 00:04:36
Message-ID: CA+hUKG+K-cc+LLn=Ys6ivf-+AqyHqd1ycsPHYRLo9oW3PbCDTQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Tue, Nov 8, 2022 at 12:07 PM Jérémie Grauer
<jeremie(dot)grauer(at)cosium(dot)com> wrote:
> On 06/11/2022 03:38, Andres Freund wrote:
> > On 2022-11-03 16:54:13 +0100, Jérémie Grauer wrote:
> >> Currently pg_rewind refuses to run if full_page_writes is off. This is to
> >> prevent it to run into a torn page during operation.
> >>
> >> This is usually a good call, but some file systems like ZFS are naturally
> >> immune to torn page (maybe btrfs too, but I don't know for sure for this
> >> one).
> >
> > Note that this isn't about torn pages in case of crashes, but about reading
> > pages while they're being written to.

> Like I wrote above, ZFS will prevent torn pages on writes, like
> full_page_writes does.

Just to spell out the distinction Andres was making, and maybe try to
answer a couple of questions if I can, there are two completely
different phenomena here:

1. Generally full_page_writes is for handling a lack of atomic writes
on power loss, but ZFS already does that itself by virtue of its COW
design and data-logging in certain cases.

2. Here we are using full_page_writes to handle lack of atomicity
when there are concurrent reads and writes to the same file from
different threads. Basically, by turning on full_page_writes we say
that we don't trust any block that might have been written to during
the copying. Again, ZFS already handles that for itself: it uses
range locking in the read and write paths (see zfs_rangelock_enter()
in zfs_write() etc), BUT that's only going to work if the actual
pread()/pwrite() system calls that reach ZFS are aligned with
PostgreSQL's pages.

Every now and then a discussion breaks out about WTF POSIX actually
requires WRT concurrent read/write, but it's trivial to show that the
most popular Linux filesystem exposes randomly mashed-up data from old
and new versions of even small writes if you read while a write is
concurrently in progress[1], while many others don't. That's what the
2nd thing is protecting against. I think it must be possible to show
that breaking on ZFS too, *if* the file regions arriving into system
calls are NOT correctly aligned. As Andres points out, <stdio.h>
buffered IO streams create a risk there: we have no idea what system
calls are reaching ZFS, so it doesn't seem safe to turn off full page
writes unless you also fix that.

> > Does ZFS actually guarantee that there never can be short reads? As soon as
> > they are possible, full page writes are neededI may be missing something here: how does full_page_writes prevents
> short _reads_ ?

I don't know, but I think the paranoid approach would be that if you
get a short read, you go back and pread() at least that whole page, so
all your system calls are fully aligned. Then I think you'd be safe?
Because zfs_read() does:

/*
* Lock the range against changes.
*/
zfs_locked_range_t *lr = zfs_rangelock_enter(&zp->z_rangelock,
zfs_uio_offset(uio), zfs_uio_resid(uio), RL_READER);

So it should be possible to make a safe version of this patch, by
teaching the file-reading code to require BLCKSZ integrity for all
reads.

[1] https://www.postgresql.org/message-id/CA%2BhUKG%2B19bZKidSiWmMsDmgUVe%3D_rr0m57LfR%2BnAbWprVDd_cw%40mail.gmail.com

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Peter Smith 2022-11-08 00:18:36 Re: [DOCS] Stats views and functions not in order?
Previous Message Jérémie Grauer 2022-11-07 23:07:09 Re: new option to allow pg_rewind to run without full_page_writes