Re: Don't synchronously wait for already-in-progress IO in read stream

From: Nazir Bilal Yavuz <byavuz81(at)gmail(dot)com>
To: Melanie Plageman <melanieplageman(at)gmail(dot)com>
Cc: Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, pgsql-hackers(at)postgresql(dot)org, Peter Geoghegan <pg(at)bowt(dot)ie>, Tomas Vondra <tv(at)fuzzy(dot)cz>
Subject: Re: Don't synchronously wait for already-in-progress IO in read stream
Date: 2026-02-05 16:56:24
Message-ID: CAN55FZ17ScU--nGM8cjjPtwPMOMonKZ9vMehPcc2sOQNLVskvA@mail.gmail.com
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

Thank you for working on this!

On Sat, 24 Jan 2026 at 00:04, Melanie Plageman
<melanieplageman(at)gmail(dot)com> wrote:
>
> On Sun, Nov 9, 2025 at 5:21 PM Thomas Munro <thomas(dot)munro(at)gmail(dot)com> wrote:
> >
> > I suppose (or perhaps vaguely recall from an off-list discussion?)
> > that you must have considered merging the new
> > is-it-already-in-progress check into ReadBuffersCanStartIO(). I
> > suppose the nowait argument would become a tri-state argument with a
> > value that means "don't wait for an in-progress read, just give me the
> > IO handle so I can 'join' it as a foreign waiter", with a new output
> > argument to receive the handle, or something along those lines, and I
> > guess you'd need a tri-state result, and perhaps s/Can/Try/ in the
> > name. That'd remove the double-check (extra header lock-unlock cycle)
> > and associated race that can cause that rare synchronous wait (which
> > must still happen sometimes in the duelling concurrent scan use
> > case?), at the slight extra cost of having to allocate and free a
> > handle in the case of repeated blocks (eg the index->heap scan use
> > case), but at least that's just backend-local list pushups and doesn't
> > do extra work otherwise. Is there some logical problem with that
> > approach? Is the code just too clumsy?
>
> Attached v3 basically does what you suggested above. Now, we should
> only have to wait if the backend encounters a buffer after another
> backend has set BM_IO_IN_PROGRESS but before that other backend has
> set the buffer descriptor's wait reference.
>
> 0001 and 0002 are Andres' test-related patches. 0003 is a change I
> think is required to make one of the tests stable (esp on the BSDs).
> 0004 is a bit of preliminary refactoring and 0005 is Andres' foreign
> IO concept but with your suggested structure and my suggested styling.
> I could potentially break out more into smaller refactoring commits,
> but I don't think it's too bad the way it is.

I confirm that I am able to produce the regression that Andres
mentioned with the patches excluding 0005, and 0005 fixes the
regression.

> A few things about the patch that I'm not sure about:
>
> - I don't know if pgaio_submit_staged() is in all the right places
> (and not in too many places). I basically do it before we would wait
> when starting read IO on the buffer. In the permanent buffers case,
> that's now only when BM_IO_IN_PROGRESS is set but the wait reference
> isn't valid yet. This can't happen in the temporary buffers case, so
> I'm not sure we need to call pgaio_submit_staged().

I agree with you, I think we don't need to call pgaio_submit_staged()
for the temporary buffers case.

> - StartBufferIO() is no longer invoked in the AsyncReadBuffers() path.
> We could refactor it so that it works for AsyncReadBuffers(), but that
> would involve returning something that distinguishes between
> IO_IN_PROGRESS and IO already done. And StartBufferIO()'s comment
> explicitly says it wants to avoid that.
> If we keep my structure, with AsyncReadBuffers() using its own helper
> (PrepareNewReadBufferIO()) instead of StartBufferIO(), then it seems
> like we need some way to make it clear what StartBufferIO() is for.
> I'm not sure what would collectively describe its current users,
> though. It also now has no non-test callers passing nowait as true.
> However, once we add write combining, it will, so it seems like we
> should leave it the way it is to avoid churn. However, other
> developers might be confused in the interim.

I don't have a comment for this.

> - In the 004_read_stream tests, I wonder if there is a way to test
> that we don't wait for foreign IO until WaitReadBuffers(). We have
> tests for the stream accessing the same block, which in some cases
> will exercise the foreign IO path. But it doesn't distinguish between
> the old behavior -- waiting for the IO to complete when starting read
> IO on it -- and the new behavior -- not waiting until
> WaitReadBuffers(). That may not be possible to test, though.

Won't 'stream accessing the same block test' almost always test the
new behavior (not waiting until WaitReadBuffers())? Having dedicated
tests for both cases would be helpful, though.

My review:

0001:

0001 LGTM.
---------------

0002:

diff --git a/src/test/modules/test_aio/t/004_read_stream.pl
b/src/test/modules/test_aio/t/004_read_stream.pl
+foreach my $method (TestAio::supported_io_methods())
+{
+ $node->adjust_conf('postgresql.conf', 'io_method', 'worker');
+ $node->start();
+ test_io_method($method, $node);
+ $node->stop();
+}

This seems wrong, we always test io_method=worker. I think we need to
replace 'worker' with the $method. Also, we can add check below to the
test_io_method function in the 004_read_stream.pl:
```
is($node->safe_psql('postgres', 'SHOW io_method'),
$io_method, "$io_method: io_method set correctly");
```

Other than that, 0002 LGTM.
---------------

0003:

> 0003 is a change I
> think is required to make one of the tests stable (esp on the BSDs).

0003 LGTM.
---------------

> 0004 is a bit of preliminary refactoring and 0005 is Andres' foreign
> IO concept but with your suggested structure and my suggested styling.
> I could potentially break out more into smaller refactoring commits,
> but I don't think it's too bad the way it is.

0004:

Nitpick but I prefer something like TrackBufferHit() or
CountBufferHit() as a function name instead of ProcessBufferHit().
ProcessBufferHit() gives the impression that the function is doing a
job more than it currently does. Other than that, 0004 LGTM.
---------------

0005:

0005 LGTM. However, I am still looking into the AIO code. I wanted to
share my review so far.
---------------

--
Regards,
Nazir Bilal Yavuz
Microsoft

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Andres Freund 2026-02-05 16:58:41 Re: Fix pg_stat_get_backend_wait_event() for aux processes
Previous Message Andres Freund 2026-02-05 16:19:46 Re: pg_upgrade: transfer pg_largeobject_metadata's files when possible