Re: Postgres, fsync, and OSs (specifically linux)

From: Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>
To: Andres Freund <andres(at)anarazel(dot)de>
Cc: Dmitry Dolgov <9erthalion6(at)gmail(dot)com>, Craig Ringer <craig(at)2ndquadrant(dot)com>, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, Robert Haas <robertmhaas(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Postgres, fsync, and OSs (specifically linux)
Date: 2018-06-14 05:30:55
Message-ID: CAEepm=04ZCG_8N3m61kXZP-7Ecr02HUNNG-QsAhwyFLim4su2g@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Wed, May 23, 2018 at 8:02 AM, Andres Freund <andres(at)anarazel(dot)de> wrote:
> [patches]

Hi Andres,

Obviously there is more work to be done here but the basic idea in
your clone-fd-checkpointer branch as of today seems OK to me. I think
Craig and I both had similar ideas (sending file descriptors that have
an old enough f_wb_err) but we thought rlimit management was going to
be hard (shared memory counters + deduplication, bleugh). You made it
look easy. Nice work.

First, let me describe in my own words what's going on, mostly to make
sure I really understand this:

1. We adopt a new fd lifecycle that is carefully tailored to avoid
error loss on Linux, and can't hurt on other OSes. By sending the
file descriptor used for every write to the checkpointer, we guarantee
that (1) the inode stays pinned (the file is "in flight" in the
socket, even if the sender closes it before the checkpointer receives
it) so Linux won't be tempted to throw away the precious information
in i_mapping->wb_err, and (2) the checkpointer finishes up with a file
descriptor that points to the very same "struct file" with the
f_wb_err value that was originally sampled before the write, by the
sender. So we can't miss any write-back errors. Wahey! However,
errors are still reported only once, so we probably need to panic.

Hmm... Is there any way that the *sender* could finish up in
file_check_and_advance_wb_err() for the same struct file, before the
checkpointer manages to call fsync() on its dup'd fd? I don't
immediately see how (it looks like you have to call one of the various
sync functions to reach that, and your patch removes the fallback
just-call-FileSync()-myself code from register_dirty_segment()). I
guess it would be bad if, say, close() were to do that in some future
Linux release because then we'd have no race-free way to tell the
checkpointer that the file is borked before it runs fsync() and
potentially gets an OK response and reports a successful checkpoint
(even if we panicked, with sufficiently bad timing it might manage to
report a successful checkpoint).

2. In order to make sure that we don't exceed our soft limit on the
number of file descriptors per process, you use a simple backend-local
counter in the checkpointer, on the theory that we don't care about
fds (or, technically, the files they point to) waiting in the socket
buffer, we care only about how many the checkpointer has actually
received but not yet closed. As long as we make sure there is space
for at least one more before we read one message, everything should be
fine. Good and simple.

One reason I thought this would be harder is because I had no model of
how RLIMIT_NOFILE would apply when you start flinging fds around
between processes (considering there can be moments when neither end
has the file open), so I feared the worst and thought we would need to
maintain a counter in shared memory and have senders and receiver
cooperate to respect it. My intuition that this was murky and
required pessimism wasn't too far from the truth actually: apparently
the kernel didn't do a great job at accounting for that until a
patch[1] landed for CVE-2013-4312.

The behaviour in older releases is super lax, so no problem there.
The behaviour from 4.9 onward (or is it 4.4.1?) is that you get a
separate per-user RLIMIT_NOFILE allowance for in-flight fds. So
effectively the sender doesn't have to worry about about fds it has
sent but closed and the receiver doesn't have to care about fds it
hasn't received yet, so your counting scheme seems OK. As for
exceeding RLIMIT_NOFILE with in-flight fds, it's at least bounded by
the fact that the socket would block/EWOULDBLOCK if the receiver isn't
draining it fast enough and can only hold a small and finite amount of
data and thus file descriptors, so we can probably just ignore that.
If you did manage to exceed it, I think you'd find out about that with
ETOOMANYREFS at send time (curiously absent from the sendmsg() man
page, but there in black and white in the source for
unix_attach_fds()), and then you'd just raise an error (currently
FATAL in your patch). I have no idea how the rlimit for SCM-ified
files works on other Unixoid systems though.

Some actual code feedback:

+ if (entry->syncfds[forknum][segno] == -1)
+ {
+ char *path = mdpath(entry->rnode,
forknum, segno);
+ open_fsync_queue_files++;
+ /* caller must have reserved entry */
+ entry->syncfds[forknum][segno] =
+ FileOpenForFd(fd, path);
+ pfree(path);
+ }
+ else
+ {
+ /*
+ * File is already open. Have to keep
the older fd, errors
+ * might only be reported to it, thus
close the one we just
+ * got.
+ *
+ * XXX: check for errrors.
+ */
+ close(fd);
+ }

Wait... it's not guaranteed to be older in open() time, is it? It's
just older in sendmsg() time. Those might be different:

A: open("base/42/1234")
A: write()
kernel: inode->i_mapping->wb_err++
B: open("base/42/1234")
B: write()
B: sendmsg()
A: sendmsg()
C: recvmsg() /* clone B's fd */
C: recvmsg() /* clone A's fd, throw it away because we already have B's */
C: fsync()

I think that would eat an error under the 4.13 code. I think it might
be OK under the new 4.17 code, because the new "must report it to at
least one caller" thing would save the day. So now I'm wondering if
that's getting backported to the 4.14 long term kernel.

Aha, yes it has been already[2].

So... if you don't have to worry about kernels without that patch, I
suspect the exact ordering doesn't matter anymore, as long as
*someone* held the file open at all times beween write and fsync() to
keep the inode around, which this patch achieves. (And of course no
other process sets the ERRSEQ_SEEN flag, for example some other kind
of backend or random other program that opened our data file and
called one of the sync syscalls, or any future syscalls that start
calling file_check_and_advance_wb_err()).

+ /*
+ * FIXME: do DuplicateHandle dance for windows - can that work
+ * trivially?
+ */

I don't know, I guess something like CreatePipe() and then
write_duplicate_handle()? And some magic (maybe our own LWLock) to
allow atomic messages?

A more interesting question is: how will you cap the number file
handles you send through that pipe? On that OS you call
DuplicateHandle() to fling handles into another process's handle table
directly. Then you send the handle number as plain old data to the
other process via carrier pigeon, smoke signals, a pipe etc. That's
interesting because the handle allocation is asynchronous from the
point of view of the receiver. Unlike the Unix case where the
receiver can count handles and make sure there is space for one more
before it reads a potentially-SCM-containing message, here the
*senders* will somehow need to make sure they don't create too many in
the receiving process. I guess that would involve a shared counter,
and a strategy for what to do when the number is too high (probably
just wait).

Hmm. I wonder if that would be a safer approach on all operating systems.

+ if (socketpair(AF_UNIX, SOCK_STREAM | SOCK_CLOEXEC, 0, fsync_fds) < 0)
...
+ size = sendmsg(sock, &msg, 0);

Here you are relying on atomic sending and receiving of whole messages
over a stream socket. For example, in Linux's unix_stream_sendmsg()
it's going to be chopped into buffers of size (sk->sk_sndbuf >> 1) -
64 (I have no clue how big that is) that are appended one at a time to
the receiving socket's queue, with no locking in between, so a
concurrently send messages can be interleaved with yours. How big is
big enough for that to happen? There doesn't seem to be an equivalent
of PIPE_BUF for Unix domain sockets. These tiny messages are almost
certainly safe, but I wonder if we should be using SOCK_SEQPACKET
instead of SOCK_STREAM?

Might be a less theoretical problem if we switch to variable sized
messages containing file paths as you once contemplated in an off-list
chat.

Presumably for EXEC_BACKEND we'll need to open
PGDATA/something/something/socket or similar.

+ if (returnCode < 0)
+ {
+ /* XXX: decide on policy */
+
bms_add_member(entry->requests[forknum], segno);

Obviously this is pseudocode (doesn't even keep the return value), but
just BTW, I think that if we decide not to PANIC unconditionally on
any kind of fsync() failure, we definitely can't use bms_add_member()
here (it might fail to allocate, and then we forget the segment, raise
and error and won't try again). It's got to be PANIC or no-fail code
(like the patch I proposed in another thread).

+SendFsyncRequest(CheckpointerRequest *request, int fd)
...
+ * Don't think short reads will ever happen in realistic
...
+ ereport(FATAL, (errmsg("could not receive
fsync request: %m")));

Short *writes*, could not *send*.

+ * back, as that'd lead to loosing error reporting guarantees on

s/loosing/losing/

[1] https://github.com/torvalds/linux/commit/712f4aad406bb1ed67f3f98d04c044191f0ff593
[2] https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable.git/commit/lib/errseq.c?h=linux-4.14.y&id=0799a0ea96e4923f52f85fe315b62e9176a3319c

--
Thomas Munro
http://www.enterprisedb.com

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Rajkumar Raghuwanshi 2018-06-14 06:06:16 ntile() throws ERROR when hashagg is false
Previous Message Amit Langote 2018-06-14 05:29:59 Re: Partitioning with temp tables is broken