Re: Postgres, fsync, and OSs (specifically linux)

From: Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Craig Ringer <craig(at)2ndquadrant(dot)com>, Asim R P <apraveen(at)pivotal(dot)io>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>, Andres Freund <andres(at)anarazel(dot)de>
Subject: Re: Postgres, fsync, and OSs (specifically linux)
Date: 2018-11-19 04:48:05
Message-ID: CAEepm=16aauN3LMHrVZ-uoqU8-k7aoSdGC3t7PghewVVsjUwtQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Fri, Nov 9, 2018 at 9:03 AM Thomas Munro
<thomas(dot)munro(at)enterprisedb(dot)com> wrote:
> On Fri, Nov 9, 2018 at 7:07 AM Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
> > On Wed, Nov 7, 2018 at 9:41 PM Thomas Munro
> > <thomas(dot)munro(at)enterprisedb(dot)com> wrote:
> > > My plan is do a round of testing and review of this stuff next week
> > > once the dust is settled on the current minor releases (including
> > > fixing a few typos I just spotted and some word-smithing). All going
> > > well, I will then push the resulting patches to master and all
> > > supported stable branches, unless other reviews or objections appear.

...

On Sun, Nov 18, 2018 at 3:20 PM Thomas Munro
<thomas(dot)munro(at)enterprisedb(dot)com> wrote:
> I think these patches are looking good now. If I don't spot any other
> problems or hear any objections, I will commit them tomorrow-ish.

Hearing no objections, pushed to all supported branches.

Thank you to Craig for all his work getting to the bottom of this, to
Andres for his open source diplomacy, and the Linux guys for their
change "errseq: Always report a writeback error once" which came out
of that.

Some more comments:

* The promotion of errors from close() to PANIC may or may not be
effective considering that it doesn't have interlocking with
concurrent checkpoints. I'm not sure if it can really happen on local
file systems anyway... this may fall under the category of "making
PostgreSQL work reliably on NFS", a configuration that is not
recommended currently, and a separate project IMV.

* In 9.4 and 9.5 there is no checking of errors from
sync_file_range(), and I didn't add any for now. It was claimed that
sync_file_range() without BEFORE/AFTER can't consume errors[1].
Errors are promoted in 9.6+ for consistency because we already looked
at the return code, so we won't long rely on that knowledge in the
long term.

* I personally believe it is safe to run with data_sync_retry = on on
any file system on FreeBSD, and ZFS on any operating system... but I
see no need to make recommendations about that in the documentation,
other than that you should investigate the behaviour of your operating
system if you really want to turn it on.

* A PANIC (and possibly ensuing crash restart loop if the I/O error is
not transient) is of course a very unpleasant failure mode, but it is
one that we already had for the WAL and control file. So I'm not sure
I'd personally bother to run with the non-default setting even on a
system where I believe it to be safe (considering the low likelihood
that I/O failure is isolated to certain files); at best it probably
gives you a better experience if the fs underneath a non-default
tablespace dies.

* The GUC is provided primarily because this patch is so drastic in
its effect that it seems like we owe our users a way to disable it on
principle, and that seems to outweigh a desire not to add GUCs in
back-branches.

* If I/O errors happen, your system is probably toast and you need to
fail over or restore from backups, but at least we won't tell you any
lies about checkpoints succeeding. In rare scenarios, perhaps
involving a transient failure of virtualised storage with thin
provisioning as originally described by Craig, the system may actually
be able to continue running, and with this change we should now be
able to avoid data loss by recovering from the WAL.

* As noted the commit message, this isn't quite the end of the story.
See the fsync queue redesign thread[2], WIP for master only.

[1] https://www.postgresql.org/message-id/20180430160945.5s5qfoqryhtmugxl%40alap3.anarazel.de
[2] https://www.postgresql.org/message-id/flat/CAEepm=2gTANm=e3ARnJT=n0h8hf88wqmaZxk0JYkxw+b21fNrw(at)mail(dot)gmail(dot)com

--
Thomas Munro
http://www.enterprisedb.com

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Amit Kapila 2018-11-19 05:06:02 Re: WIP: Avoid creation of the free space map for small tables
Previous Message Michael Paquier 2018-11-19 04:39:58 Re: [HACKERS] Restricting maximum keep segments by repslots