From: | Thomas Munro <thomas(dot)munro(at)gmail(dot)com> |
---|---|
To: | Jérémie Grauer <jeremie(dot)grauer(at)cosium(dot)com> |
Cc: | Andres Freund <andres(at)anarazel(dot)de>, pgsql-hackers(at)lists(dot)postgresql(dot)org |
Subject: | Re: new option to allow pg_rewind to run without full_page_writes |
Date: | 2022-11-08 00:04:36 |
Message-ID: | CA+hUKG+K-cc+LLn=Ys6ivf-+AqyHqd1ycsPHYRLo9oW3PbCDTQ@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
On Tue, Nov 8, 2022 at 12:07 PM Jérémie Grauer
<jeremie(dot)grauer(at)cosium(dot)com> wrote:
> On 06/11/2022 03:38, Andres Freund wrote:
> > On 2022-11-03 16:54:13 +0100, Jérémie Grauer wrote:
> >> Currently pg_rewind refuses to run if full_page_writes is off. This is to
> >> prevent it to run into a torn page during operation.
> >>
> >> This is usually a good call, but some file systems like ZFS are naturally
> >> immune to torn page (maybe btrfs too, but I don't know for sure for this
> >> one).
> >
> > Note that this isn't about torn pages in case of crashes, but about reading
> > pages while they're being written to.
> Like I wrote above, ZFS will prevent torn pages on writes, like
> full_page_writes does.
Just to spell out the distinction Andres was making, and maybe try to
answer a couple of questions if I can, there are two completely
different phenomena here:
1. Generally full_page_writes is for handling a lack of atomic writes
on power loss, but ZFS already does that itself by virtue of its COW
design and data-logging in certain cases.
2. Here we are using full_page_writes to handle lack of atomicity
when there are concurrent reads and writes to the same file from
different threads. Basically, by turning on full_page_writes we say
that we don't trust any block that might have been written to during
the copying. Again, ZFS already handles that for itself: it uses
range locking in the read and write paths (see zfs_rangelock_enter()
in zfs_write() etc), BUT that's only going to work if the actual
pread()/pwrite() system calls that reach ZFS are aligned with
PostgreSQL's pages.
Every now and then a discussion breaks out about WTF POSIX actually
requires WRT concurrent read/write, but it's trivial to show that the
most popular Linux filesystem exposes randomly mashed-up data from old
and new versions of even small writes if you read while a write is
concurrently in progress[1], while many others don't. That's what the
2nd thing is protecting against. I think it must be possible to show
that breaking on ZFS too, *if* the file regions arriving into system
calls are NOT correctly aligned. As Andres points out, <stdio.h>
buffered IO streams create a risk there: we have no idea what system
calls are reaching ZFS, so it doesn't seem safe to turn off full page
writes unless you also fix that.
> > Does ZFS actually guarantee that there never can be short reads? As soon as
> > they are possible, full page writes are neededI may be missing something here: how does full_page_writes prevents
> short _reads_ ?
I don't know, but I think the paranoid approach would be that if you
get a short read, you go back and pread() at least that whole page, so
all your system calls are fully aligned. Then I think you'd be safe?
Because zfs_read() does:
/*
* Lock the range against changes.
*/
zfs_locked_range_t *lr = zfs_rangelock_enter(&zp->z_rangelock,
zfs_uio_offset(uio), zfs_uio_resid(uio), RL_READER);
So it should be possible to make a safe version of this patch, by
teaching the file-reading code to require BLCKSZ integrity for all
reads.
From | Date | Subject | |
---|---|---|---|
Next Message | Peter Smith | 2022-11-08 00:18:36 | Re: [DOCS] Stats views and functions not in order? |
Previous Message | Jérémie Grauer | 2022-11-07 23:07:09 | Re: new option to allow pg_rewind to run without full_page_writes |