Re: Use pg_pwritev_with_retry() instead of write() in dir_open_for_write() to avoid partial writes?

From: Thomas Munro <thomas(dot)munro(at)gmail(dot)com>
To: Michael Paquier <michael(at)paquier(dot)xyz>
Cc: Bharath Rupireddy <bharath(dot)rupireddyforpostgres(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: Use pg_pwritev_with_retry() instead of write() in dir_open_for_write() to avoid partial writes?
Date: 2022-08-07 09:49:19
Message-ID: CA+hUKG+cKFKnn7wg_1=NJPw2ekdG+kw9Z4HcYR=XzehkT7n5kw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Sun, Aug 7, 2022 at 7:56 PM Michael Paquier <michael(at)paquier(dot)xyz> wrote:
> FWIW, when it comes to that we have a couple of routines that just use
> '0' to mean such a thing, aka palloc0(). I find 0002 confusing, as it
> introduces in fe_utils.c a new wrapper
> (pg_pwritev_with_retry_and_write_zeros) on what's already a wrapper
> (pg_pwritev_with_retry) for pwrite().

What about pg_write_initial_zeros(fd, size)? How it writes zeros (ie
using vector I/O, and retrying) seems to be an internal detail, no?
"initial" to make it clearer that it's at offset 0, or maybe
pg_pwrite_zeros(fd, size, offset).

> A second thing is that pg_pwritev_with_retry_and_write_zeros() is
> designed to work on WAL segments initialization and it uses
> XLOG_BLCKSZ and PGAlignedXLogBlock for the job, but there is nothing
> in its name that tells us so. This makes me question whether
> file_utils.c is a good location for this second thing. Could a new
> file be a better location? We have a xlogutils.c in the backend, and
> a name similar to that in src/common/ would be one possibility.

Yeah, I think it should probably be disconnected from XLOG_BLCKSZ, or
maybe it's OK to use BLCKSZ with a comment to say that it's a bit
arbitrary, or maybe it's better to define a new zero buffer of some
arbitrary size just in this code if that is too strange. We could
experiment with different size buffers to see how it performs, bearing
in mind that every time we double it you halve the number of system
calls, but also bearing in mind that at some point it's too much for
the stack. I can tell you that the way that code works today was not
really written with performance in mind (unlike, say, the code
reverted from 9.4 that tried to do this with posix_fallocate()), it
was just finding an excuse to call pwritev(), to exercise new fallback
code being committed for use by later AIO stuff (more patches coming
soon). The retry support was added because it seemed plausible that
some system out there would start to do short writes as we cranked up
the sizes for some implementation reason other than ENOSPC, so we
should make a reusable retry routine.

I think this should also handle the remainder after processing whole
blocks, just for completeness. If I call the code as presented with size
8193, I think this code will only write 8192 bytes.

I think if this ever needs to work on O_DIRECT files there would be an
alignment constraint on the buffer and size, but I don't think we have
to worry about that for now.

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tomas Vondra 2022-08-07 13:18:52 Re: logical decoding and replication of sequences
Previous Message Álvaro Herrera 2022-08-07 08:21:23 Re: pgsql: BRIN: mask BRIN_EVACUATE_PAGE for WAL consistency checking