Re: patch to allow disable of WAL recycling

From: Thomas Munro <thomas(dot)munro(at)gmail(dot)com>
To: Jerry Jelinek <jerry(dot)jelinek(at)joyent(dot)com>
Cc: Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Michael Paquier <michael(at)paquier(dot)xyz>, Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: patch to allow disable of WAL recycling
Date: 2019-03-08 01:25:48
Message-ID: CA+hUKGKSYZTyyVO5scCSV1UD0E4c73kc6BnfqKr+d7D6barYYg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Fri, Mar 8, 2019 at 12:35 PM Jerry Jelinek <jerry(dot)jelinek(at)joyent(dot)com> wrote:
> On Thu, Mar 7, 2019 at 3:09 PM Thomas Munro <thomas(dot)munro(at)gmail(dot)com> wrote:
>> My understanding is that it's not really the COW-ness that makes it
>> not necessary, it's the fact that fdatasync() doesn't do anything
>> different from fsync() on ZFS and there is no situation where
>> fdatasync() succeeds, you lose power, you come back up and find that
>> the file size is wrong or a hole in the middle of the file has come
>> back from the dead, and you lost the data. The whole concept of "data
>> sync" implies that file meta-data and file contents are cached and
>> synchronized separately and you can deliberately ask for weaker
>> coherency to cut down on IOs; *that's* the thing that ZFS doesn't
>> have, and couldn't benefit from because it's just going to write stuff
>> in its tidy sequential log in the form of all-or-nothing transactions
>> anyway. I don't know if that's also true for eg BTRFS or any other
>> COW filesystem that might be out there, but I don't know why you'd
>> want to mention COW instead of wal_sync_mode as the motivation when
>> the source code comment know better.
>
>
> Hopefully I am not misinterpreting your comment here, but I'm not sure I fully agree with that assessment. I can't speak for other filesystems, but for ZFS, none of the zero-filled blocks will actually be written to disk, but that determination happens fairly late in the process, after the thousands of write system calls have been processed. So on ZFS, these writes are basically useless, except as a way to increase the size of the file. No disk space is actually allocated. However, on any COW filesystem, any write to any of these zero-filled blocks will have to allocate a new block, so nothing about "preallocating space" has been accomplished by all of these system calls. At least, preallocating space is my understanding of why the zero-fill is currently being performed.

It seems like you're focusing on the performance and I'm focusing on
the safety. Obviously it's a complete waste of time to try to
"preallocate" space on COW filesystems since they will not reuse that
space anyway by definition. My point was that it may be unsafe to
turn if off when configured to use fdatasync() for later writes to the
file on filesystems that make fewer durability guarantees with
fdatasync() than with a full fsync(), and that seems like another
newsworthy angle on this for end users to know about. I dunno, maybe
those things are so closely linked that it's OK to write just "only
turn it off on COW filesystems", but I'm wondering why we don't
mention the actual reason for the feature when we make that claim in
the comments.

Hmm... digging a bit further. So, those comments in xlog.c date from
2013/2014 when this stuff was going down:

https://lkml.org/lkml/2012/9/3/83
https://www.usenix.org/conference/osdi14/technical-sessions/presentation/zheng_mai
http://www.openldap.org/lists/openldap-devel/201411/msg00002.html

So that was not actually the intended behaviour of fdatasync(), but
rather a bug in ext3/4 that's been fixed now. POSIX says "File
attributes that are not necessary for data retrieval (access time,
modification time, status change time) need not be successfully
transferred prior to returning to the calling process.", and the Linux
man page says that it "does not flush modified metadata unless that
metadata is needed in order to allow a subsequent data retrieval to be
correctly handled", so... if there are no OS bugs, the comments in
xlog.c are overly pessimistic and the only effect of using fdatasync()
instead of fsync() to to avoid extra IOs for mtime etc.

I still like the pessimism in the code. But OK, I withdraw my
complaint about that sentence in the documentation for now! :-)

--
Thomas Munro
https://enterprisedb.com

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Michael Paquier 2019-03-08 01:27:52 Re: pgsql: Removed unused variable, openLogOff.
Previous Message Michael Paquier 2019-03-08 01:23:24 Re: Tighten error control for OpenTransientFile/CloseTransientFile