Re: Postgres, fsync, and OSs (specifically linux)

From: Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>
To: Andres Freund <andres(at)anarazel(dot)de>
Cc: Dmitry Dolgov <9erthalion6(at)gmail(dot)com>, Craig Ringer <craig(at)2ndquadrant(dot)com>, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, Robert Haas <robertmhaas(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Postgres, fsync, and OSs (specifically linux)
Date: 2018-08-10 12:09:16
Message-ID: CAEepm=2WSPP03-20XHpxohSd2UyG_dvw5zWS1v7Eas8Rd=5e4A@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox
Thread:
Lists: pgsql-hackers

On Sun, Jul 29, 2018 at 6:14 PM, Thomas Munro
<thomas(dot)munro(at)enterprisedb(dot)com> wrote:
> As a way of poking this thread, here are some more thoughts.

I am keen to move this forward, not only because it is something we
need to get fixed, but also because I have some other pending patches
in this area and I want this sorted out first.

Here are some small fix-up patches for Andres's patchset:

1. Use FD_CLOEXEC instead of the non-portable Linuxism SOCK_CLOEXEC.

2. Fix the self-deadlock hazard reported by Dmitry Dolgov. Instead
of the checkpoint trying to send itself a CKPT_REQUEST_SYN message
through the socket (whose buffer may be full), I included the
ckpt_started counter in all messages. When AbsorbAllFsyncRequests()
drains the socket, it stops at messages with the current ckpt_started
value.

3. Handle postmaster death while waiting.

4. I discovered that macOS would occasionally return EMSGSIZE for
sendmsg(), but treating that just like EAGAIN seems to work the next
time around. I couldn't make that happen on FreeBSD (I mention that
because the implementation is somehow related). So handle that weird
case on macOS only for now.

Testing on other Unixoid systems would be useful. The case that
produced occasional EMSGSIZE on macOS was: shared_buffers=1MB,
max_files_per_process=32, installcheck-parallel. Based on man pages
that seems to imply an error in the client code but I don't see it.

(I also tried to use SOCK_SEQPACKET instead of SOCK_STREAM, but it's
not supported on macOS. I also tried to use SOCK_DGRAM, but that
produced occasional ENOBUFS errors and retrying didn't immediately
succeed leading to busy syscall churn. This is all rather
unsatisfying, since SOCK_STREAM is not guaranteed by any standard to
be atomic, and we're writing messages from many backends into the
socket so we're assuming atomicity. I don't have a better idea that
is portable.)

There are a couple of FIXMEs remaining, and I am aware of three more problems:

* Andres mentioned to me off-list that there may be a deadlock risk
where the checkpointer gets stuck waiting for an IO lock. I'm going
to look into that.
* Windows. Patch soon.
* The ordering problem that I mentioned earlier: the patchset wants to
keep the *oldest* fd, but it's really the oldest it has received. An
idea Andres and I discussed is to use a shared atomic counter to
assign a number to all file descriptors just before their first write,
and send that along with it to the checkpointer. Patch soon.

--
Thomas Munro
http://www.enterprisedb.com

Attachment Content-Type Size
0001-Use-portable-close-on-exec-syscalls.patch application/octet-stream 1.7 KB
0002-Fix-deadlock-in-AbsorbAllFsyncRequests.patch application/octet-stream 4.5 KB
0003-Handle-postmaster-death-CFI-improve-error-messages-a.patch application/octet-stream 1.7 KB
0004-Handle-EMSGSIZE-on-macOS.-Fix-misleading-error-messa.patch application/octet-stream 2.4 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Alexander Kuzmenkov 2018-08-10 12:33:26 Re: Reopen logfile on SIGHUP
Previous Message Fabien COELHO 2018-08-10 11:50:49 Re: pgbench exit code