Re: Postgres, fsync, and OSs (specifically linux)

From: Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>
To: Andres Freund <andres(at)anarazel(dot)de>
Cc: Dmitry Dolgov <9erthalion6(at)gmail(dot)com>, Craig Ringer <craig(at)2ndquadrant(dot)com>, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, Robert Haas <robertmhaas(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Postgres, fsync, and OSs (specifically linux)
Date: 2018-07-29 06:14:56
Message-ID: CAEepm=0uAGf6FvmX7YqHc7hqqSHRCWK17BwrXZgE+YYOcyR4Gw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Thu, Jun 14, 2018 at 5:30 PM, Thomas Munro
<thomas(dot)munro(at)enterprisedb(dot)com> wrote:
> On Wed, May 23, 2018 at 8:02 AM, Andres Freund <andres(at)anarazel(dot)de> wrote:
>> [patches]
>
> A more interesting question is: how will you cap the number file
> handles you send through that pipe? On that OS you call
> DuplicateHandle() to fling handles into another process's handle table
> directly. Then you send the handle number as plain old data to the
> other process via carrier pigeon, smoke signals, a pipe etc. That's
> interesting because the handle allocation is asynchronous from the
> point of view of the receiver. Unlike the Unix case where the
> receiver can count handles and make sure there is space for one more
> before it reads a potentially-SCM-containing message, here the
> *senders* will somehow need to make sure they don't create too many in
> the receiving process. I guess that would involve a shared counter,
> and a strategy for what to do when the number is too high (probably
> just wait).
>
> Hmm. I wonder if that would be a safer approach on all operating systems.

As a way of poking this thread, here are some more thoughts. Buffer
stealing currently look something like this:

Evicting backend:
lseek(fd)
write(fd)
...enqueue-fsync-request via shm...

Checkpointer:
...push into hash table...

With the patch it presumably looks something like this:

Evicting backend:
lseek(fd)
write(fd)
sendmsg(fsync_socket) /* passes fd */

Checkpointer:
recvmsg(fsync_socket) /* gets a copy of fd */
...push into hash table...
close(fd) /* for all but the first one received for the same file */

That takes us from 2 syscalls to 5 per evicted buffer. I suppose it's
possible that on some operating systems that might hurt a bit, given
that it's happening at the granularity of 1GB data files that could
have a lot of backends working in them concurrently. I have no idea
if it's really a problem on any particular OS. Admittedly on Linux
it's probably just a bunch of fast atomic ops and RCU stuff...
probably only the existing write() actually takes the inode lock or
anything that heavy, and that's probably lost in the noise in an
evict-heavy workload. I don't know, I guess it's probably not a
problem, but I thought I'd mention that.

Contention on the new fsync socket doesn't seem to be a new problem
per se since it replaces a contention point we already had:
CheckpointerCommLock. If that was acceptable today then perhaps that
indicates that any in-kernel contention created by the new syscalls is
also OK.

My feeling so far is that I'd probably go for sender-collapses model
(and it might even be necessary on Windows?) if doing this as a new
feature, but I fully understand your desire to do it in a much simpler
way that could be back-patched more easily. I'm just slightly
concerned about the unintended consequence risk that comes with
exercising an operating system feature that not all operating system
authors probably intended to be used at high frequency. Nothing that
can't be assuaged by testing.

* the queue is full and contains no duplicate entries. In that case, we
* let the backend know by returning false.
*/
-bool
-ForwardFsyncRequest(RelFileNode rnode, ForkNumber forknum, BlockNumber segno)
+void
+ForwardFsyncRequest(RelFileNode rnode, ForkNumber forknum, BlockNumber segno,
+ File file)

Comment out of date.

--
Thomas Munro
http://www.enterprisedb.com

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Thomas Munro 2018-07-29 09:04:35 Re: GiST VACUUM
Previous Message Tom Lane 2018-07-29 02:35:50 Re: Usability fail with psql's \dp command