Re: Streaming I/O, vectored I/O (WIP)

From: Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
To: Thomas Munro <thomas(dot)munro(at)gmail(dot)com>
Cc: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, Andres Freund <andres(at)anarazel(dot)de>, Melanie Plageman <melanieplageman(at)gmail(dot)com>
Subject: Re: Streaming I/O, vectored I/O (WIP)
Date: 2023-11-29 23:16:30
Message-ID: f748992c-26f1-4389-9349-06a0ab75046c@iki.fi
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 29/11/2023 21:39, Thomas Munro wrote:
> One thing I wasn't 100% happy with was the treatment of ENOSPC. A few
> callers believe that short writes set errno: they error out with a
> message including %m. We have historically set errno = ENOSPC inside
> FileWrite() if the write size was unexpectedly small AND the kernel
> didn't set errno to a non-zero value (having set it to zero ourselves
> earlier). In FileWriteV(), I didn't want to do that because it is
> expensive to compute the total write size from the vector array and we
> managed to measure an effect due to that in some workloads.
>
> Note that the smgr patch actually handles short writes by continuing,
> instead of raising an error. Short writes do already occur in the
> wild on various systems for various rare technical reasons other than
> ENOSPC I have heard (imagine transient failure to acquire some
> temporary memory that the kernel chooses not to wait for, stuff like
> that, though certainly many people and programs believe they should
> not happen[1]), and it seems like a good idea to actually handle them
> as our write sizes increase and the probability of short writes might
> presumably increase.

Maybe we should bite the bullet and always retry short writes in
FileWriteV(). Is that what you meant by "handling them"?

If the total size is expensive to calculate, how about passing it as an
extra argument? Presumably it is cheap for the callers to calculate at
the same time that they build the iovec array?

> With the previous version of the patch, we'd have to change a couple
> of other callers not to believe that short writes are errors and set
> errno (callers are inconsistent on this point). I don't really love
> that we have "fake" system errors but I also want to stay focused
> here, so in this new version V3 I tried a new approach: I realised I
> can just always set errno without needing the total size, so that
> (undocumented) aspect of the interface doesn't change. The point
> being that it doesn't matter if you clobber errno with a bogus value
> when the write was non-short. Thoughts?

Feels pretty ugly, but I don't see anything outright wrong with that.

--
Heikki Linnakangas
Neon (https://neon.tech)

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Heikki Linnakangas 2023-11-29 23:36:25 Re: Refactoring backend fork+exec code
Previous Message Matthias van de Meent 2023-11-29 22:59:33 Re: Parallel CREATE INDEX for BRIN indexes