Re: Don't synchronously wait for already-in-progress IO in read stream

From: Andres Freund <andres(at)anarazel(dot)de>
To: Melanie Plageman <melanieplageman(at)gmail(dot)com>
Cc: Nazir Bilal Yavuz <byavuz81(at)gmail(dot)com>, Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, pgsql-hackers(at)postgresql(dot)org, Peter Geoghegan <pg(at)bowt(dot)ie>, Tomas Vondra <tv(at)fuzzy(dot)cz>
Subject: Re: Don't synchronously wait for already-in-progress IO in read stream
Date: 2026-03-17 17:26:02
Message-ID: u73un3xeljr4fiidzwi4ikcr6vm7oqugn4fo5vqpstjio6anl2@hph6fvdiiria
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

> Attached v5 adds some comments to the tests, fixes a few nits in the
> actual code, and adds a commit to fix what I think is an existing
> off-by-one error in TRACE_POSTGRESQL_BUFFER_READ_DONE.

> Subject: [PATCH v5 3/5] Fix off-by-one error in read IO tracing
>
> ---
> src/backend/storage/buffer/bufmgr.c | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
> index 00bc609529a..0723d4f3dd8 100644
> --- a/src/backend/storage/buffer/bufmgr.c
> +++ b/src/backend/storage/buffer/bufmgr.c
> @@ -1990,7 +1990,7 @@ AsyncReadBuffers(ReadBuffersOperation *operation, int *nblocks_progress)
> * must have started out as a miss in PinBufferForBlock(). The other
> * backend will track this as a 'read'.
> */
> - TRACE_POSTGRESQL_BUFFER_READ_DONE(forknum, blocknum + operation->nblocks_done,
> + TRACE_POSTGRESQL_BUFFER_READ_DONE(forknum, blocknum + operation->nblocks_done - 1,
> operation->smgr->smgr_rlocator.locator.spcOid,
> operation->smgr->smgr_rlocator.locator.dbOid,
> operation->smgr->smgr_rlocator.locator.relNumber,
> --

Ah, the issue is that we already incremented nblocks_done, right? Maybe it'd
be easier to understand if we stashed blocknum + nblocks_done into a local
var, and use it in in both branches of if (!ReadBuffersCanStartIO())?

This probably needs to be backpatched...

> Subject: [PATCH v5 4/5] Make buffer hit helper
>
> Already two places count buffer hits, requiring quite a few lines of
> code since we do accounting in so many places. Future commits will add
> more locations, so refactor into a helper.
>
> Reviewed-by: Nazir Bilal Yavuz <byavuz81(at)gmail(dot)com>
> Discussion: https://postgr.es/m/flat/zljergweqti7x67lg5ije2rzjusie37nslsnkjkkby4laqqbfw%403p3zu522yykv
> ---
> src/backend/storage/buffer/bufmgr.c | 111 ++++++++++++++--------------
> 1 file changed, 56 insertions(+), 55 deletions(-)
>
> diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
> index 0723d4f3dd8..399004c2e44 100644
> --- a/src/backend/storage/buffer/bufmgr.c
> +++ b/src/backend/storage/buffer/bufmgr.c
> @@ -648,6 +648,10 @@ static inline BufferDesc *BufferAlloc(SMgrRelation smgr,
> bool *foundPtr, IOContext io_context);
> static bool AsyncReadBuffers(ReadBuffersOperation *operation, int *nblocks_progress);
> static void CheckReadBuffersOperation(ReadBuffersOperation *operation, bool is_complete);
> +
> +static pg_attribute_always_inline void CountBufferHit(BufferAccessStrategy strategy,
> + Relation rel, char persistence, SMgrRelation smgr,
> + ForkNumber forknum, BlockNumber blocknum);
> static Buffer GetVictimBuffer(BufferAccessStrategy strategy, IOContext io_context);
> static void FlushUnlockedBuffer(BufferDesc *buf, SMgrRelation reln,
> IOObject io_object, IOContext io_context);
> @@ -1226,8 +1230,6 @@ PinBufferForBlock(Relation rel,
> bool *foundPtr)
> {
> BufferDesc *bufHdr;
> - IOContext io_context;
> - IOObject io_object;
>
> Assert(blockNum != P_NEW);
>
> @@ -1236,17 +1238,6 @@ PinBufferForBlock(Relation rel,
> persistence == RELPERSISTENCE_PERMANENT ||
> persistence == RELPERSISTENCE_UNLOGGED));
>
> - if (persistence == RELPERSISTENCE_TEMP)
> - {
> - io_context = IOCONTEXT_NORMAL;
> - io_object = IOOBJECT_TEMP_RELATION;
> - }
> - else
> - {
> - io_context = IOContextForStrategy(strategy);
> - io_object = IOOBJECT_RELATION;
> - }
> -

I'm mildly worried that this will lead to a bit worse code generation, the
compiler might have a harder time figuring out that io_context/io_object
doesn't change across multiple PinBufferForBlock calls. Although it already
might be unable to do so (we don't mark IOContextForStrategy as
pure [1]).

I kinda wonder if, for StartReadBuffersImpl(), we should go the opposite
direction, and explicitly look up IOContextForStrategy(strategy) *before* the
actual_nblocks loop to make sure the compiler doesn't inject external function
calls (which will in all likelihood require register spilling etc).

I don't think that necessarily has to conflict with the goal of this patch -
most of the the deduplicated stuff isn't io_context, so the helper will be
beneficial even if have to pull out the io_context/io_object determination to
the callsites.

> TRACE_POSTGRESQL_BUFFER_READ_START(forkNum, blockNum,
> smgr->smgr_rlocator.locator.spcOid,
> smgr->smgr_rlocator.locator.dbOid,
> @@ -1254,18 +1245,11 @@ PinBufferForBlock(Relation rel,
> smgr->smgr_rlocator.backend);
>
> if (persistence == RELPERSISTENCE_TEMP)
> - {
> bufHdr = LocalBufferAlloc(smgr, forkNum, blockNum, foundPtr);
> - if (*foundPtr)
> - pgBufferUsage.local_blks_hit++;
> - }
> else
> - {
> bufHdr = BufferAlloc(smgr, persistence, forkNum, blockNum,
> - strategy, foundPtr, io_context);
> - if (*foundPtr)
> - pgBufferUsage.shared_blks_hit++;
> - }
> + strategy, foundPtr, IOContextForStrategy(strategy));
> +
> if (rel)
> {
> /*

And here it might end up adding a separate persistence == RELPERSISTENCE_TEMP
branch in CountBufferHit(), I suspect the compiler may not be able to optimize
it away.

At the very least I'd invert the call to CountBufferHit() and the
pgstat_count_buffer_read(), as the latter will probably prevent most
optimizations (due to the compiler not being able to prove that
(rel)->pgstat_info->counts.blocks_fetched is a different memory location as
*foundPtr).

> +/*
> + * We track various stats related to buffer hits. Because this is done in a
> + * few separate places, this helper exists for convenience.
> + */
> +static pg_attribute_always_inline void
> +CountBufferHit(BufferAccessStrategy strategy,
> + Relation rel, char persistence, SMgrRelation smgr,
> + ForkNumber forknum, BlockNumber blocknum)
> +{
> + IOContext io_context;
> + IOObject io_object;
> +
> + if (persistence == RELPERSISTENCE_TEMP)
> + {
> + io_context = IOCONTEXT_NORMAL;
> + io_object = IOOBJECT_TEMP_RELATION;
> + }
> + else
> + {
> + io_context = IOContextForStrategy(strategy);
> + io_object = IOOBJECT_RELATION;
> + }
> +
> + TRACE_POSTGRESQL_BUFFER_READ_DONE(forknum,
> + blocknum,
> + smgr->smgr_rlocator.locator.spcOid,
> + smgr->smgr_rlocator.locator.dbOid,
> + smgr->smgr_rlocator.locator.relNumber,
> + smgr->smgr_rlocator.backend,
> + true);
> +
> + if (persistence == RELPERSISTENCE_TEMP)
> + pgBufferUsage.local_blks_hit += 1;
> + else
> + pgBufferUsage.shared_blks_hit += 1;
> +
> + if (rel)
> + pgstat_count_buffer_hit(rel);
> +
> + pgstat_count_io_op(io_object, io_context, IOOP_HIT, 1, 0);
> +
> + if (VacuumCostActive)
> + VacuumCostBalance += VacuumCostPageHit;
> +}

I don't think "Count*" is a great name for something that does tracepoints and
vacuum cost balance accounting, the latter actually changes behavior of the
program due to the sleeps it injects.

The first alternative I have is AccountForBufferHit(), not great, but still
seems a bit better.

> From 4d737fa14f333abc4ee6ade8cb0340530695e887 Mon Sep 17 00:00:00 2001
> From: Melanie Plageman <melanieplageman(at)gmail(dot)com>
> Date: Fri, 23 Jan 2026 14:00:31 -0500
> Subject: [PATCH v5 5/5] Don't wait for already in-progress IO
> MIME-Version: 1.0
> Content-Type: text/plain; charset=UTF-8
> Content-Transfer-Encoding: 8bit
>
> When a backend attempts to start a read on a buffer and finds that I/O
> is already in progress, it previously waited for that I/O to complete
> before initiating reads for any other buffers. Although the backend must
> still wait for the I/O to finish when later acquiring the buffer, it
> should not need to wait at read start time. Other buffers may be
> available for I/O, and in some workloads this waiting significantly
> reduces concurrency.
>
> For example, index scans may repeatedly request the same heap block. If
> the backend waits each time it encounters an in-progress read, the
> access pattern effectively degenerates into synchronous I/O. By
> introducing the concept of foreign I/O operations, a backend can record
> the buffer’s wait reference and defer waiting until WaitReadBuffers()
> when it actually acquires the buffer.
>
> In rare cases, a backend may still need to wait when starting a read if
> it encounters a buffer after another backend has set BM_IO_IN_PROGRESS
> but before the buffer descriptor’s wait reference has been set. Such
> windows should be brief and uncommon.
>
> Author: Melanie Plageman <melanieplageman(at)gmail(dot)com>
> Reviewed-by: Andres Freund <andres(at)anarazel(dot)de>
> Reviewed-by: Nazir Bilal Yavuz <byavuz81(at)gmail(dot)com>
> Discussion: https://postgr.es/m/flat/zljergweqti7x67lg5ije2rzjusie37nslsnkjkkby4laqqbfw%403p3zu522yykv

> +/*
> + * In AsyncReadBuffers(), when preparing a buffer for reading and setting
> + * BM_IO_IN_PROGRESS, the buffer may already have I/O in progress or may
> + * already contain the desired block. AsyncReadBuffers() must distinguish
> + * between these cases (and the case where it should initiate I/O) so it can
> + * mark an in-progress buffer as foreign I/O rather than waiting on it.
> + */
> +typedef enum PrepareReadBuffer_Status
> +{
> + READ_BUFFER_ALREADY_DONE,
> + READ_BUFFER_IN_PROGRESS,
> + READ_BUFFER_READY_FOR_IO,
> +} PrepareReadBuffer_Status;

I don't personally like mixing underscore and camel case naming within one
name.

I wonder if might be worth splitting this up in a refactoring and a
"behavioural change" commit. Might be too complicated.

Candidates for a split seem to be:
- Moving pgaio_io_acquire_nb() to earlier
- Introduce PrepareNewReadBufferIO/PrepareAdditionalReadBuffer without support
for READ_BUFFER_IN_PROGRESS
- introduce READ_BUFFER_IN_PROGRESS

> /*
> * We track various stats related to buffer hits. Because this is done in a
> * few separate places, this helper exists for convenience.
> @@ -1815,8 +1791,11 @@ WaitReadBuffers(ReadBuffersOperation *operation)
> * b) reports some time as waiting, even if we never waited
> *
> * we first check if we already know the IO is complete.
> + *
> + * Note that operation->io_return is uninitialized for foreign IO,
> + * so we cannot count that wait time.
> */

I'm confused - your comment says we can't count wait time with a foreign IO,
but then oes on to count foreign IO time? The lack of io_return just means we
can't do the cheaper pre-check for PGAIO_RS_UNKNOWN, no?

> - if (aio_ret->result.status == PGAIO_RS_UNKNOWN &&
> + if ((operation->foreign_io || aio_ret->result.status == PGAIO_RS_UNKNOWN) &&
> !pgaio_wref_check_done(&operation->io_wref))
> {
> instr_time io_start = pgstat_prepare_io_time(track_io_timing);
> @@ -1835,11 +1814,33 @@ WaitReadBuffers(ReadBuffersOperation *operation)
> Assert(pgaio_wref_check_done(&operation->io_wref));
> }
>
> - /*
> - * We now are sure the IO completed. Check the results. This
> - * includes reporting on errors if there were any.
> - */
> - ProcessReadBuffersResult(operation);
> + if (unlikely(operation->foreign_io))
> + {
> + Buffer buffer = operation->buffers[operation->nblocks_done];
> + BufferDesc *desc = BufferIsLocal(buffer) ?
> + GetLocalBufferDescriptor(-buffer - 1) :
> + GetBufferDescriptor(buffer - 1);
> + uint32 buf_state = pg_atomic_read_u64(&desc->state);
> +
> + if (buf_state & BM_VALID)
> + {
> + operation->nblocks_done += 1;
> + Assert(operation->nblocks_done <= operation->nblocks);
> +
> + CountBufferHit(operation->strategy,
> + operation->rel, operation->persistence,
> + operation->smgr, operation->forknum,
> + operation->blocknum + operation->nblocks_done - 1);

Probably worth including a comment explaining why we count this as a hit. IIRC
earlier versions had such a comment.

> +/*
> + * Local version of PrepareNewReadBufferIO(). Here instead of localbuf.c to
> + * avoid an external function call.
> + */
> +static PrepareReadBuffer_Status
> +PrepareNewLocalReadBufferIO(ReadBuffersOperation *operation,
> + Buffer buffer)

Hm, seems the test in 0002 should be extended to cover the the temp table case.

> +{
> + BufferDesc *desc = GetLocalBufferDescriptor(-buffer - 1);
> + uint64 buf_state = pg_atomic_read_u64(&desc->state);
> +
> + /* Already valid, no work to do */
> + if (buf_state & BM_VALID)
> + {
> + pgaio_wref_clear(&operation->io_wref);
> + return READ_BUFFER_ALREADY_DONE;
> + }

Is this reachable for local buffers?

> + pgaio_submit_staged();
> +
> + if (pgaio_wref_valid(&desc->io_wref))
> + {
> + operation->io_wref = desc->io_wref;
> + operation->foreign_io = true;
> + return READ_BUFFER_IN_PROGRESS;
> + }
> +
> + /*
> + * While it is possible for a buffer to have been prepared for IO but not
> + * yet had its wait reference set, there's no way for us to know that for
> + * temporary buffers. Thus, we'll prepare for own IO on this buffer.
> + */
> + return READ_BUFFER_READY_FOR_IO;

Is that actually possible? And would it be ok to just do start IO in that
case?

> +/*
> + * Try to start IO on the first buffer in a new run of blocks. If AIO is in
> + * progress, be it in this backend or another backend, we just associate the
> + * wait reference with the operation and wait in WaitReadBuffers(). This turns
> + * out to be important for performance in two workloads:
> + *
> + * 1) A read stream that has to read the same block multiple times within the
> + * readahead distance. This can happen e.g. for the table accesses of an
> + * index scan.
> + *
> + * 2) Concurrent scans by multiple backends on the same relation.
> + *
> + * If we were to synchronously wait for the in-progress IO, we'd not be able
> + * to keep enough I/O in flight.
> + *
> + * If we do find there is ongoing I/O for the buffer, we set up a 1-block
> + * ReadBuffersOperation that WaitReadBuffers then can wait on.
> + *
> + * It's possible that another backend has started IO on the buffer but not yet
> + * set its wait reference. In this case, we have no choice but to wait for
> + * either the wait reference to be valid or the IO to be done.
> + */
> +static PrepareReadBuffer_Status
> +PrepareNewReadBufferIO(ReadBuffersOperation *operation,
> + Buffer buffer)
> +{

I'm not sure I love "New" here, compared to "Additional". Perhaps "Begin" &
"Continue"? Or "First" & "Additional"? Or ...

> + uint64 buf_state;
> + BufferDesc *desc;
> +
> + if (BufferIsLocal(buffer))
> + return PrepareNewLocalReadBufferIO(operation, buffer);
> +
> + ResourceOwnerEnlarge(CurrentResourceOwner);
> + desc = GetBufferDescriptor(buffer - 1);
> +
> + for (;;)
> + {
> + buf_state = LockBufHdr(desc);

Perhaps worth adding an
Assert(buf_state & BM_TAG_VALID)?

> + /* Already valid, no work to do */
> + if (buf_state & BM_VALID)
> + {
> + UnlockBufHdr(desc);
> + pgaio_wref_clear(&operation->io_wref);

Perhaps we should clear &operation->io_wref once at the start? Because right
now it'll be cleared if BM_VALID and it'll be set if BM_IO_IN_PROGRESS &&
&desc->io_wref, but it won't be touched when in BM_IO_IN_PROGRESS without a
wref set. It seems either we should just touch &operation->io_wref if
BM_IO_IN_PROGRESS && pgaio_wref_valid(&desc->io_wref)
or we should reliably do it.

> + return READ_BUFFER_ALREADY_DONE;
> + }
> +
> + if (buf_state & BM_IO_IN_PROGRESS)
> + {
> + /* Join existing read */
> + if (pgaio_wref_valid(&desc->io_wref))
> + {
> + operation->io_wref = desc->io_wref;
> + operation->foreign_io = true;
> + UnlockBufHdr(desc);
> + return READ_BUFFER_IN_PROGRESS;
> + }
> +
> + /*
> + * If the wait ref is not valid but the IO is in progress, someone
> + * else started IO but hasn't set the wait ref yet. We have no
> + * choice but to wait until the IO completes.
> + */
> + UnlockBufHdr(desc);
> + pgaio_submit_staged();
> + WaitIO(desc);
> + continue;

Before this commit there was an explanation for this submit:

- /*
- * If this backend currently has staged IO, we need to submit the pending
- * IO before waiting for the right to issue IO, to avoid the potential for
- * deadlocks (and, more commonly, unnecessary delays for other backends).
- */

Seems that vanished?

> +/*
> + * When building a new IO from multiple buffers, we won't include buffers
> + * that are already valid or already in progress. This function should only be
> + * used for additional adjacent buffers following the head buffer in a new IO.
> + *
> + * Returns true if the buffer was successfully prepared for IO and false if it
> + * is rejected and the read IO should not include this buffer.
> + */
> +static bool
> +PrepareAdditionalReadBuffer(Buffer buffer)

I think it'd be good to mention that this may never wait for IO or such to
avoid deadlocks.

> + /* Check if we can start IO on the first to-be-read buffer */
> + if ((status = PrepareNewReadBufferIO(operation, buffers[nblocks_done])) <
> + READ_BUFFER_READY_FOR_IO)
> + {

I don't love this < bit. For one there's no mention in
PrepareReadBuffer_Status mentioning that the numerical order is important. Any
reason to not just test != READ_BUFFER_READY_FOR_IO?

The assignment inside the if also looks somewhat awkward. For while() loops
there's often not really a better way to write it, but here you could just as
well do the status assignment in a line before.

> + pgaio_io_release(ioh);
> + *nblocks_progress = 1;
> + if (status == READ_BUFFER_ALREADY_DONE)
> + {
> + /*
> + * Someone else has already completed this block, we're done.
> + *
> + * When IO is necessary, ->nblocks_done is updated in
> + * ProcessReadBuffersResult(), but that is not called if no IO is
> + * necessary. Thus update here.
> + */
> + operation->nblocks_done += 1;
> + Assert(operation->nblocks_done <= operation->nblocks);
> +
> + /*
> + * Report and track this as a 'hit' for this backend, even though
> + * it must have started out as a miss in PinBufferForBlock(). The
> + * other backend will track this as a 'read'.
> + */
> + CountBufferHit(operation->strategy,
> + operation->rel, operation->persistence,
> + operation->smgr, operation->forknum,
> + operation->blocknum + operation->nblocks_done - 1);
> + return false;
> + }
> +
> + /* The IO is already in-progress */
> + Assert(status == READ_BUFFER_IN_PROGRESS);
> + CheckReadBuffersOperation(operation, false);

I was about to suggest that there should be a CheckReadBuffersOperation() for
both returns here, but there already are CheckReadBuffersOperation after calls
to AsyncReadBuffers(), so I think this CheckReadBuffersOperation could just be
removed.

> /*
> - * Check if we can start IO on the first to-be-read buffer.
> - *
> - * If an I/O is already in progress in another backend, we want to wait
> - * for the outcome: either done, or something went wrong and we will
> - * retry.
> + * How many neighboring-on-disk blocks can we scatter-read into other
> + * buffers at the same time? In this case we don't wait if we see an I/O
> + * already in progress. We already set BM_IO_IN_PROGRESS for the head
> + * block, so we should get on with that I/O as soon as possible.
> */
> - if (!ReadBuffersCanStartIO(buffers[nblocks_done], false))
> + for (int i = nblocks_done + 1; i < operation->nblocks; i++)
> {
> - /*
> - * Someone else has already completed this block, we're done.
> - *
> - * When IO is necessary, ->nblocks_done is updated in
> - * ProcessReadBuffersResult(), but that is not called if no IO is
> - * necessary. Thus update here.
> - */
> - operation->nblocks_done += 1;
> - *nblocks_progress = 1;
> -
> - pgaio_io_release(ioh);
> - pgaio_wref_clear(&operation->io_wref);
> - did_start_io = false;
> + if (!PrepareAdditionalReadBuffer(buffers[i]))
> + break;
> + /* Must be consecutive block numbers. */
> + Assert(BufferGetBlockNumber(buffers[i - 1]) ==
> + BufferGetBlockNumber(buffers[i]) - 1);

Seems this assert could stand to be before the PrepareAdditionalReadBuffer(),
as it better hold true before we try to BM_IO_IN_PROGRESS?

I realize this is old, but since you're whacking this around...

> diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
> index 4017896f951..f85a9acc6ac 100644
> --- a/src/include/storage/bufmgr.h
> +++ b/src/include/storage/bufmgr.h
> @@ -147,6 +147,8 @@ struct ReadBuffersOperation
> int flags;
> int16 nblocks;
> int16 nblocks_done;
> + /* true if waiting on another backend's IO */
> + bool foreign_io;
> PgAioWaitRef io_wref;
> PgAioReturn io_return;
> };

This adds an alignment-padding hole between nblocks_done and io_wref. Read
stream can allocate quite a few of these, so it's probably worth avoiding?

There's a padding hole between persistence and forknum, but that seems a bit
ugly to utilize. A bit less ugly if we swapped forknum and persistence.

Or we could make 'flags' a uint8/16 (flags should imo always be unsigned, and
there are just four flag bits right now).

But perhaps it's also not worth worrying about right now.

[1] https://gcc.gnu.org/onlinedocs/gcc/Common-Attributes.html#index-pure

Greetings,

Andres Freund

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Nathan Bossart 2026-03-17 17:31:51 Re: table AM option passing
Previous Message Heikki Linnakangas 2026-03-17 17:15:10 Assertion failure in hash_kill_items()