Re: Some thoughts on NFS

From: Thomas Munro <thomas(dot)munro(at)gmail(dot)com>
To: Andres Freund <andres(at)anarazel(dot)de>
Cc: pgsql-hackers(at)postgresql(dot)org, Craig Ringer <craig(dot)ringer(at)2ndquadrant(dot)com>
Subject: Re: Some thoughts on NFS
Date: 2019-02-19 22:25:22
Message-ID: CA+hUKGJ3J_ZYKpOFM9EF2BOA8y71MfP5_ipLPsSwpB+dTt+GBQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Wed, Feb 20, 2019 at 5:52 AM Andres Freund <andres(at)anarazel(dot)de> wrote:
> > 1. Figure out how to get the ALLOCATE command all the way through the
> > stack from PostgreSQL to the remote NFS server, and know for sure that
> > it really happened. On the Debian buster Linux 4.18 system I checked,
> > fallocate() reports EOPNOTSUPP for fallocate(), and posix_fallocate()
> > appears to succeed but it doesn't really do anything at all (though I
> > understand that some versions sometimes write zeros to simulate
> > allocation, which in this case would be equally useless as it doesn't
> > reserve anything on an NFS server). We need the server and NFS client
> > and libc to be of the right version and cooperate and tell us that
> > they have really truly reserved space, but there isn't currently a way
> > as far as I can tell. How can we achieve that, without writing our
> > own NFS client?
> >
> > 2. Deal with the resulting performance suckage. Extending 8kb at a
> > time with synchronous network round trips won't fly.
>
> I think I'd just go for fsync();pwrite();fsync(); as the extension
> mechanism, iff we're detecting a tablespace is on NFS. The first fsync()
> to make sure there's no previous errors that we could mistake for
> ENOSPC, the pwrite to extend, the second fsync to make sure there's
> actually space. Then we can detect ENOSPC properly. That possibly does
> leave some errors where we could mistake ENOSPC as something more benign
> than it is, but the cases seem pretty narrow, due to the previous
> fsync() (maybe the other side could be thin provisioned and get an
> ENOSPC there - but in that case we didn't actually loose any data. The
> only dangerous scenario I can come up with is that the remote side is on
> thinly provisioned CoW system, and a concurrent write to an earlier
> block runs out of space - but seriously, good riddance to you).

This seems to make sense, and has the advantage that it uses
interfaces that exist right now. But it seems a bit like we'll have
to wait for them to finish building out the errseq_t support for NFS
to avoid various races around the mapping's AS_EIO flag (A: fsync() ->
EIO, B: fsync() -> SUCCESS, log checkpoint; A: panic), and then maybe
we'd have to get at least one of { fd-passing, direct IO, threads }
working on our side ...

--
Thomas Munro
https://enterprisedb.com

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Andres Freund 2019-02-19 22:29:19 Re: Some thoughts on NFS
Previous Message Thomas Munro 2019-02-19 22:08:45 Re: Some thoughts on NFS