Re: Postgres, fsync, and OSs (specifically linux)

From: Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>
To: Andres Freund <andres(at)anarazel(dot)de>
Cc: Craig Ringer <craig(at)2ndquadrant(dot)com>, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, Robert Haas <robertmhaas(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Postgres, fsync, and OSs (specifically linux)
Date: 2018-05-19 04:51:40
Message-ID: CAEepm=05_NJXxaC59bTd7vq8w9aCim2_61Au5dWUW39Z6+bYPg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Sat, May 19, 2018 at 9:03 AM, Andres Freund <andres(at)anarazel(dot)de> wrote:
> I've written a patch series for this. Took me quite a bit longer than I
> had hoped.

Great.

> I plan to switch to working on something else for a day or two next
> week, and then polish this further. I'd greatly appreciate comments till
> then.

Took it for a spin on macOS and FreeBSD. First problem:

+ if (socketpair(AF_UNIX, SOCK_STREAM | SOCK_CLOEXEC, 0, fsync_fds) < 0)

SOCK_CLOEXEC isn't portable (FreeBSD yes since 10, macOS no, others
who knows). Adding FD_CLOEXEC to your later fcntl() calls is probably
the way to do it? I understand from reading the Linux man pages that
there are race conditions with threads but that doesn't apply here.

Next, make check hangs in initdb on both of my pet OSes when md.c
raises an error (fseek fails) and we raise and error while raising and
error and deadlock against ourselves. Backtrace here:
https://paste.debian.net/1025336/

Apparently the initial error was that mdextend() called _mdnblocks()
which called FileSeek() on vfd 43 == fd 30, pathname "base/1/2704",
but when I check my operating system open file descriptor table I find
that there is no fd 30: there is a 29 and a 31, so it has already been
unexpectedly closed.

I could dig further and/or provide a shell on a system with dev tools.

> I didn't want to do this now, but I think we should also consider
> removing all awareness of segments from the fsync request queue. Instead
> it should deal with individual files, and the segmentation should be
> handled by md.c. That'll allow us to move all the necessary code to
> smgr.c (or checkpointer?); Thomas said that'd be helpful for further
> work. I personally think it'd be a lot simpler, because having to have
> long bitmaps with only the last bit set for large append only relations
> isn't a particularly sensible approach imo. The only thing that that'd
> make more complicated is that the file/database unlink requests get more
> expensive (as they'd likely need to search the whole table), but that
> seems like a sensible tradeoff. Alternatively using a tree structure
> would be an alternative obviously. Personally I was thinking that we
> should just make the hashtable be over a pathname, that seems most
> generic.

+1

I'll be posting a patch shortly that also needs similar machinery, but
can't easily share with md.c due to technical details. I'd love there
to be just one of those, and for it to be simpler and general.

--
Thomas Munro
http://www.enterprisedb.com

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Thomas Munro 2018-05-19 06:12:52 Re: Postgres, fsync, and OSs (specifically linux)
Previous Message Amit Langote 2018-05-19 04:22:20 Re: Should we add GUCs to allow partition pruning to be disabled?