Re: Postgres, fsync, and OSs (specifically linux)

From: Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>
To: Craig Ringer <craig(at)2ndquadrant(dot)com>
Cc: Asim R P <apraveen(at)pivotal(dot)io>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>, Robert Haas <robertmhaas(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>
Subject: Re: Postgres, fsync, and OSs (specifically linux)
Date: 2018-10-18 23:26:39
Message-ID: CAEepm=38QHcGYX0xMaD8-4cJ2zj_mpcKF8AaiFm3uBqUo=8=7w@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hello hackers,

Let's try to get this issue resolved. Here is my position on the
course of action we should take in back-branches:

1. I am -1 on back-patching the fd-transfer code. It's a significant
change, and even when sufficiently debugged (I don't think it's there
yet), we have no idea what will happen on all the kernels we support
under extreme workloads. IMHO there is no way we can spring this on
users in a point release.

2. I am +1 on back-patching Craig's PANIC-on-failure logic. Doing
nothing is not an option I like. I have some feedback and changes to
propose though; see attached.

Responses to a review from Robert:

On Thu, Jul 19, 2018 at 7:23 AM Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
> 2. I don't like promote_ioerr_to_panic() very much, partly because the
> same pattern gets repeated over and over, and partly because it would
> be awkwardly-named if we discovered that another 2 or 3 errors needed
> similar handling (or some other variant handling). I suggest instead
> having a function like report_critical_fsync_failure(char *path) that
> does something like this:
>
> int elevel = ERROR;
> if (errno == EIO)
> elevel = PANIC;
> ereport(elevel,
> (errcode_for_file_access(),
> errmsg("could not fsync file \"%s\": %m", path);
>
> And similarly I'd add report_critical_close_failure. In some cases,
> this would remove wording variations (e.g. in twophase.c) but I think
> that's fine, and maybe an improvement, as discussed on another recent
> thread.

I changed it to look like data_sync_elevel(ERROR) and made it treat
all errnos the same. ENOSPC, EIO, EWOK, EIEIO, it makes no difference
to the level of faith I have that my data still exists.

> 3. slru.c calls pg_fsync() but isn't changed by the patch. That looks wrong.

Fixed.

> 4. The comment changes in snapbuild.c interact with the TODO that
> immediately follows. I think more adjustment is needed here.

I don't understand this.

> 5. It seems odd that you adjusted the comment for
> pg_fsync_no_writethrough() but not pg_fsync_writethrough() or
> pg_fsync(). Either pg_fsync_writethrough() doesn't have the same
> problem, in which case, awesome, but let's add a comment, or it does,
> in which case it should refer to the other one. And I think
> pg_fsync() itself needs a comment saying that every caller must be
> careful to use promote_ioerr_to_panic() or
> report_critical_fsync_failure() or whatever we end up calling it
> unless the fsync is not critical for data integrity.

I removed these comments and many others; I don't see the point in
scattering descriptions of this problem and references to specific
versions of Linux and -hackers archive links all over the place. I
added a comment in one place, and also added some user documentation
of the problem.

> 6. In md.c, there's a stray blank line added. But more importantly,
> the code just above that looks like this:
>
> if (!FILE_POSSIBLY_DELETED(errno) ||
> failures > 0)
> - ereport(ERROR,
> + ereport(promote_ioerr_to_panic(ERROR),
> (errcode_for_file_access(),
> errmsg("could not fsync file \"%s\": %m",
> path)));
> else
> ereport(DEBUG1,
> (errcode_for_file_access(),
> errmsg("could not fsync file \"%s\"
> but retrying: %m",
> path)));
>
> I might be all wet here, but it seems like if we enter the bottom
> branch, we still need the promote-to-panic behavior.

That case only is only reached if FILE_POSSIBLY_DELETED() on the first
time through the loop, and it detects an errno value not actually from
fsync(). It's from FileSync(), when it tries to reopen a virtual fd
and gets ENOENT, before calling fsync(). Code further down then
absorbs incoming requests before checking if that was expected,
closing a race. The comments could make that clearer, admittedly.

> 7. The comment adjustment for SyncDataDirectory mentions an
> "important" fact about fsync behavior, but then doesn't seem to change
> any logic on that basis. I think in general a number of these
> comments need a little more thought, but in this particular case, I
> think we also need to consider what the behavior should be (and the
> comment should reflect our considered judgement on that point, and the
> implementation should match).

I updated the comment. I don't think this is too relevant to the
fsync() failure case, because we'll be rewriting all changes from the
WAL again during recovery; I think this function is mostly useful for
switching from fsync = off to fsync = on and restarting, not coping
with previous fsync() failures by retrying (which we know to be
useless anyway). Someone could argue that if you restarted after
changing fsync from off to on, then this may be the first time you
learn that write-back failed, and then you're somewhat screwed whether
we panic or not, but I don't see any solution to that. Don't run
databases with fsync = off.

> 8. Andres suggested to me off-list that we should have a GUC to
> disable the promote-to-panic behavior in case it turns out to be a
> show-stopper for some user. I think that's probably a good idea.
> Adding many new ways to PANIC in a minor release without providing any
> way to go back to the old behavior sounds unfriendly. Surely, anyone
> who suffers much from this has really serious other problems anyway,
> but all the same I think we should provide an escape hatch.

+1. See the new GUC data_sync_retry, defaulting to false. If set to
true, we also need to fix the problem reported in [1], so here's the
patch for that too.

Other comments:

I don't see why sync_file_range(SYNC_FILE_RANGE_WRITE) should get a
pass here. Inspection of some version of the kernel might tell us it
can't advance the error counter and report failure, but what do we
gain by relying on that? Changed.

FD_DELETE_AT_CLOSE is not a good way to detect temporary files in
recent versions, as it doesn't detect the kind of shared temporary
files used by Parallel Hash; FD_TEMP_FILE_LIMIT is a better way.
Changed. (We could also just not bother exempting temporary files?)

I plan to continue working on the fd-transfer system as part of a
larger sync queue redesign project for 12. If we can get an agreement
that we can't possibly back-patch the fd-transfer logic, then we can
move all future discussion of that topic over to the other thread[2],
and this thread can be about consensus to back-patch the PANIC patch.
Thoughts?

[1] https://www.postgresql.org/message-id/flat/87y3i1ia4w(dot)fsf(at)news-spur(dot)riddles(dot)org(dot)uk
[2] https://www.postgresql.org/message-id/flat/CAEepm=2gTANm=e3ARnJT=n0h8hf88wqmaZxk0JYkxw+b21fNrw(at)mail(dot)gmail(dot)com

Attachment Content-Type Size
0001-Don-t-forget-about-failed-fsync-requests-v4.patch application/octet-stream 2.7 KB
0002-PANIC-on-fsync-failure-v4.patch application/octet-stream 15.8 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Michael Paquier 2018-10-18 23:50:46 Re: file cloning in pg_upgrade and CREATE DATABASE
Previous Message Peter Eisentraut 2018-10-18 21:59:00 Re: file cloning in pg_upgrade and CREATE DATABASE