Re: Postgres, fsync, and OSs (specifically linux)

From: Craig Ringer <craig(at)2ndquadrant(dot)com>
To: Andres Freund <andres(at)anarazel(dot)de>
Cc: PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Postgres, fsync, and OSs (specifically linux)
Date: 2018-04-28 12:00:25
Message-ID: CAMsr+YGz+9cEs88aq0hBbFvz1pFNGBVPNd1qsJNLkgJLhpDFag@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 28 April 2018 at 06:28, Andres Freund <andres(at)anarazel(dot)de> wrote:
> Hi,
>
> I thought I'd send this separately from [0] as the issue has become more
> general than what was mentioned in that thread, and it went off into
> various weeds.

Thanks very much for going and for the great summary.

> - Actually marking files as persistently failed would require filesystem
> changes, and filesystem metadata IO, far from guaranteed in failure
> scenarios.

Yeah, I've avoided suggesting anything like that because it seems way
too likely to lead to knock-on errors.

Like malloc'ing in an OOM path, just don't.

> The second major type of proposal was using direct-IO. That'd generally
> be a desirable feature, but a) would require some significant changes to
> postgres to be performant, b) isn't really applicable for the large
> percentage of installations that aren't tuned reasonably well, because
> at the moment the OS page cache functions as a memory-pressure aware
> extension of postgres' page cache.

Yeah. I've avoided advocating for O_DIRECT because it's a big job
(understatement). We'd need to pay so much more attention to details
of storage layout if we couldn't rely as much on the kernel neatly
organising and queuing everything for us, too.

At the risk of displaying my relative ignorance of direct I/O: Does
O_DIRECT without O_SYNC even provide a strong guarantee that when you
close() the file, all I/O has reliably succeeded? It must've gone
through the FS layer, but don't FSes do various caching and
reorganisation too? Can the same issue arise in other ways unless we
also fsync() before close() or write O_SYNC?

At one point I looked into using AIO instead. But last I looked it was
pretty spectacularly quirky when it comes to reliably flushing, and
outright broken on some versions. In any case, our multiprocessing
model would make tracking completions annoying, likely more so than
the sort of FD handoff games we've discussed.

> Another topic brought up in this thread was the handling of ENOSPC
> errors that aren't triggered on a filesystem level, but rather are
> triggered by thin provisioning. On linux that currently apprently lead
> to page cache contents being lost (and errors "eaten") in a lot of
> places, including just when doing a write().

... wow.

Is that with lvm-thin?

The thin provisioning I was mainly concerned with is SAN-based thin
provisioning, which looks like a normal iSCSI target or a normal LUN
on a HBA to Linux. Then it starts failing writes with a weird
potentially vendor-specific sense error if it runs out of backing
store. How that's handled likely depends on the specific error, the
driver, which FS you use, etc. In the case I saw, multipath+lvm+xfs,
it resulted in lost writes and fsync() errors being reported once, per
the start of the original thread.

> In a lot of cases it's
> pretty much expected that the file system will just hang or react
> unpredictably upon space exhaustion. My reading is that the block-layer
> thin provisioning code is still pretty fresh, and should only be used
> with great care. The only way to halfway reliably use it appears to
> change the configuration so space exhaustion blocks until admin
> intervention (at least dm-thinp provides allows that).

Seems that should go in the OS-specific configuration part of the
docs, along with the advice I gave on the original thread re
configuring multipath no_path_retries.

> There's some clear need to automate some more testing in this area so
> that future behaviour changes don't surprise us.

We don't routinely test ENOSPC (or memory exhaustion, or crashes) in
PostgreSQL even on bog standard setups.

Like the performance farm discussion, this is something I'd like to
pick up at some point. I'm going to need to talk to the team I work
with regarding time/resources allocation, but I think it's important
that we make such testing more of a routine thing.

> - Matthew Wilcox proposed (and posted a patch) that'd partially revert
> behaviour to the pre v4.13 world, by *also* reporting errors to
> "newer" file-descriptors if the error hasn't previously been
> reported. That'd still not guarantee that the error is reported
> (memory pressure could evict information without open fd), but in most
> situations we'll again get the error in the checkpointer.
>
> This seems largely be agreed upon. It's unclear whether it'll go into
> the stable backports for still-maintained >= v4.13 kernels.

That seems very sensible. In our case we're very unlikely to have some
other unrelated process come in and fsync() our files for us.

I'd want to be sure the report didn't get eaten by sync() or syncfs() though.

> - syncfs() will be fixed so it reports errors properly - that'll likely
> require passing it an O_PATH filedescriptor to have space to store the
> errseq_t value that allows discerning already reported and new errors.
>
> No patch has appeared yet, but the behaviour seems largely agreed
> upon.

Good, but as you noted, of limited use to us unless we want to force
users to manage space for temporary and unlogged relations completely
separately.

I wonder if we could convince the kernel to offer a file_sync_mode
xattr to control this? (Hint: I'm already running away in a mylar fire
suit).

> - Make per-filesystem error counts available in a uniform (i.e. same for
> every supporting fs) manner. Right now it's very hard to figure out
> whether errors occurred. There seemed general agreement that exporting
> knowledge about such errors is desirable. Quite possibly the syncfs()
> fix above will provide the necessary infrastructure. It's unclear as
> of yet how the value would be exposed. Per-fs /sys/ entries and an
> ioctl on O_PATH fds have been mentioned.
>
> These'd error counts would not vanish due to memory pressure, and they
> can be checked even without knowing which files in a specific
> filesystem have been touched (e.g. when just untar-ing something).
>
> There seemed to be fairly widespread agreement that this'd be a good
> idea. Much less clearer whether somebody would do the work.
>
> - Provide config knobs that allow to define the FS error behaviour in a
> consistent way across supported filesystems. XFS currently has various
> knobs controlling what happens in case of metadata errors [1] (retry
> forever, timeout, return up). It was proposed that this interface be
> extended to also deal with data errors, and moved into generic support
> code.
>
> While the timeline is unclear, there seemed to be widespread support
> for the idea. I believe Dave Chinner indicated that he at least has
> plans to generalize the code.

That's great. It sounds like this has revitalised some interest in the
error reporting and might yield some more general cleanups :)

> - Stop inodes with unreported errors from being evicted. This will
> guarantee that a later fsync (without an open FD) will see the
> error. The memory pressure concerns here are lower than with keeping
> all the failed pages in memory, and it could be optimized further.
>
> I read some tentative agreement behind this idea, but I think it's the
> by far most controversial one.

The main issue there would seem to be cases of whole-FS failure like
the USB-key-yank example. You're going to have to be able to get rid
of them at some point.

> - Clearly we need to treat fsync() EIO, ENOSPC errors as a PANIC and
> retry recovery. While ENODEV (underlying device went away) will be
> persistent, it probably makes sense to treat it the same or even just
> give up and shut down. One question I see here is whether we just
> want to continue crash-recovery cycles, or whether we want to limit
> that.

Right now, we'll panic once, then panic again in redo if the error
persists and give up.

On some systems, and everywhere that Pg is directly user-managed with
pg_ctl, that'll leave Pg down until the operator intervenes. Some init
systems will restart the postmaster automatically. Some will give up
after a few tries. Some will back off retries over time. It depends on
the init system. I'm not sure that's a great outcome.

So rather than giving up if redo fails, we might want to offer a knob
to retry, possibly with pause/backoff. I'm sure people currently
expect PostgreSQL to try to stay up and recover, like it does after a
segfault or most other errors.

Personally I prefer to run Pg with restart_after_crash=off and let the
init system launch a new postmaster, but that's not an option unless
you have a sensible init.

> - We need more aggressive error checking on close(), for ENOSPC and
> EIO. In both cases afaics we'll have to trigger a crash recovery
> cycle. It's entirely possible to end up in a loop on NFS etc, but I
> don't think there's a way around that.
>
> Robert, on IM, wondered whether there'd be a race between some backend
> doing a close(), triggering a PANIC, and a checkpoint succeeding. I
> don't *think* so, because the error will only happen if there's
> outstanding dirty data, and the checkpoint would have flushed that out
> if it belonged to the current checkpointing cycle.

Even if it's possible (which it sounds like it probably isn't), it
might also be one of those corner-cases-of-corner-cases where we just
shrug and worry about bigger fish.

> - The outstanding fsync request queue isn't persisted properly [3]. This
> means that even if the kernel behaved the way we'd expected, we'd not
> fail a second checkpoint :(. It's possible that we don't need to deal
> with this because we'll henceforth PANIC, but I'd argue we should fix
> that regardless. Seems like a time-bomb otherwise (e.g. after moving
> to DIO somebody might want to relax the PANIC...).

Huh! Good find. That definitely merits fixing.

> - It might be a good idea to whitelist expected return codes for write()
> and PANIC one ones that we did not expect. E.g. when hitting an EIO we
> should probably PANIC, to get back to a known good state. Even though
> it's likely that we'd again that error at fsync().
>
> - Docs.

Yep. Especially OS-specific configuration for known dangerous setups
(lvm-thin, multipath), etc. I imagine we can distill a lot of it from
the discussion and simplify a bit.

> I think we also need to audit a few codepaths. I'd be surprised if we
> PANICed appropriately on all fsyncs(), particularly around the SLRUs.

We _definitely_ do not, see the patch I sent on the other thread.

> Then there's the question of how we want to deal with kernels that
> haven't been updated with the aforementioned changes. We could say that
> we expect decent OS support and declare that we just can't handle this -
> given that at least various linux versions, netbsd, openbsd, MacOS just
> silently drop errors and we'd need different approaches for dealing with
> that, that doesn't seem like an insane approach.

> What we could do:
>
> - forward file descriptors from backends to checkpointer (using
> SCM_RIGHTS) when marking a segment dirty. That'd require some
> optimizations (see [4]) to avoid doing so repeatedly. That'd
> guarantee correct behaviour in all linux kernels >= 4.13 (possibly
> backported by distributions?), and I think it'd also make it vastly
> more likely that errors are reported in earlier kernels.

It'd be interesting to see if other platforms that support fd passing
will give us the desired behaviour too. But even if it only helps on
Linux, that's a huge majority of the PostgreSQL deployments these
days.

> - Add a pre-checkpoint hook that checks for filesystem errors *after*
> fsyncing all the files, but *before* logging the checkpoint completion
> record. Operating systems, filesystems, etc. all log the error format
> differently, but for larger installations it'd not be too hard to
> write code that checks their specific configuration.

I looked into using trace event file descriptors for this, btw, but
we'd need CAP_SYS_ADMIN to create one that captured events for other
processes. Plus filtering the events to find only events for the files
/ file systems of interest would be far from trivial. And I don't know
what guarantees we have about when events are delivered.

I'd love to be able to use inotify for this, but again, that'd only be
a new-kernels thing since it'd need an inotify extension to report I/O
errors.

Presumably mostly this check would land up looking at dmesg.

I'm not convinced it'd get widely deployed and widely used, or that
it'd be used correctly when people tried to use it. Look at the
hideous mess that most backup/standby creation scripts,
archive_command scripts, etc are.

> - Use direct IO. Due to architectural performance issues in PG and the
> fact that it'd not be applicable for all installations I don't think
> this is a reasonable fix for the issue presented here. Although it's
> independently something we should work on. It might be worthwhile to
> provide a configuration that allows to force DIO to be enabled for WAL
> even if replication is turned on.

Seems like a long term goal, but you've noted elsewhere that doing it
well would be hard. I suspect we'd need writer threads, we'd need to
know more about the underlying FS/storage layout to make better
decisions about write parallelism, etc. We get away with a lot right
now by letting the kernel and buffered I/O sort that out.

--
Craig Ringer http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Craig Ringer 2018-04-28 12:01:46 Re: Is a modern build system acceptable for older platforms
Previous Message John Naylor 2018-04-28 11:25:35 Re: WIP: a way forward on bootstrap data