Re: patch to allow disable of WAL recycling

From: Jerry Jelinek <jerry(dot)jelinek(at)joyent(dot)com>
To: Thomas Munro <thomas(dot)munro(at)gmail(dot)com>
Cc: Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Michael Paquier <michael(at)paquier(dot)xyz>, Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: patch to allow disable of WAL recycling
Date: 2019-03-26 19:24:47
Message-ID: CACPQ5Fp=aSnTKB3nZD=H=2-RUU8eSTy-7c+hcLQ5WQkfEtoYAg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Thu, Mar 7, 2019 at 6:26 PM Thomas Munro <thomas(dot)munro(at)gmail(dot)com> wrote:

> On Fri, Mar 8, 2019 at 12:35 PM Jerry Jelinek <jerry(dot)jelinek(at)joyent(dot)com>
> wrote:
> > On Thu, Mar 7, 2019 at 3:09 PM Thomas Munro <thomas(dot)munro(at)gmail(dot)com>
> wrote:
> >> My understanding is that it's not really the COW-ness that makes it
> >> not necessary, it's the fact that fdatasync() doesn't do anything
> >> different from fsync() on ZFS and there is no situation where
> >> fdatasync() succeeds, you lose power, you come back up and find that
> >> the file size is wrong or a hole in the middle of the file has come
> >> back from the dead, and you lost the data. The whole concept of "data
> >> sync" implies that file meta-data and file contents are cached and
> >> synchronized separately and you can deliberately ask for weaker
> >> coherency to cut down on IOs; *that's* the thing that ZFS doesn't
> >> have, and couldn't benefit from because it's just going to write stuff
> >> in its tidy sequential log in the form of all-or-nothing transactions
> >> anyway. I don't know if that's also true for eg BTRFS or any other
> >> COW filesystem that might be out there, but I don't know why you'd
> >> want to mention COW instead of wal_sync_mode as the motivation when
> >> the source code comment know better.
> >
> >
> > Hopefully I am not misinterpreting your comment here, but I'm not sure I
> fully agree with that assessment. I can't speak for other filesystems, but
> for ZFS, none of the zero-filled blocks will actually be written to disk,
> but that determination happens fairly late in the process, after the
> thousands of write system calls have been processed. So on ZFS, these
> writes are basically useless, except as a way to increase the size of the
> file. No disk space is actually allocated. However, on any COW filesystem,
> any write to any of these zero-filled blocks will have to allocate a new
> block, so nothing about "preallocating space" has been accomplished by all
> of these system calls. At least, preallocating space is my understanding of
> why the zero-fill is currently being performed.
>
> It seems like you're focusing on the performance and I'm focusing on
> the safety. Obviously it's a complete waste of time to try to
> "preallocate" space on COW filesystems since they will not reuse that
> space anyway by definition. My point was that it may be unsafe to
> turn if off when configured to use fdatasync() for later writes to the
> file on filesystems that make fewer durability guarantees with
> fdatasync() than with a full fsync(), and that seems like another
> newsworthy angle on this for end users to know about. I dunno, maybe
> those things are so closely linked that it's OK to write just "only
> turn it off on COW filesystems", but I'm wondering why we don't
> mention the actual reason for the feature when we make that claim in
> the comments.
>
> Hmm... digging a bit further. So, those comments in xlog.c date from
> 2013/2014 when this stuff was going down:
>
> https://lkml.org/lkml/2012/9/3/83
>
> https://www.usenix.org/conference/osdi14/technical-sessions/presentation/zheng_mai
> http://www.openldap.org/lists/openldap-devel/201411/msg00002.html
>
> So that was not actually the intended behaviour of fdatasync(), but
> rather a bug in ext3/4 that's been fixed now. POSIX says "File
> attributes that are not necessary for data retrieval (access time,
> modification time, status change time) need not be successfully
> transferred prior to returning to the calling process.", and the Linux
> man page says that it "does not flush modified metadata unless that
> metadata is needed in order to allow a subsequent data retrieval to be
> correctly handled", so... if there are no OS bugs, the comments in
> xlog.c are overly pessimistic and the only effect of using fdatasync()
> instead of fsync() to to avoid extra IOs for mtime etc.
>
> I still like the pessimism in the code. But OK, I withdraw my
> complaint about that sentence in the documentation for now! :-)

I haven't heard any additional feedback in the last couple of weeks, so I
wanted to see if there is anything else needed for this patch? I did update
the code to use pg_pwrite. A new version of the patch is attached. The only
difference from the previous version is this diff:

--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -3283,8 +3283,8 @@ XLogFileInit(XLogSegNo logsegno, bool *use_existent,
bool use_lock)
*/
errno = 0;
pgstat_report_wait_start(WAIT_EVENT_WAL_INIT_WRITE);
- fail = lseek(fd, wal_segment_size - 1, SEEK_SET) < (off_t)
0 ||
- (int) write(fd, zbuffer.data, 1) != (int) 1;
+ fail = pg_pwrite(fd, zbuffer.data, 1, wal_segment_size - 1)
!=
+ (ssize_t) 1;
pgstat_report_wait_end();
}

The latest patch is rebased, builds clean, and passes some basic testing.
Please let me know if there is anything else I could do on this.

Thanks,
Jerry

Attachment Content-Type Size
0001-wal_recycle-and-wal_init_zero.patch application/octet-stream 9.0 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Pavel Stehule 2019-03-26 19:26:23 Re: [HACKERS] generated columns
Previous Message Peter Geoghegan 2019-03-26 19:17:01 Re: Enable data checksums by default