| From: | Melanie Plageman <melanieplageman(at)gmail(dot)com> |
|---|---|
| To: | Andres Freund <andres(at)anarazel(dot)de> |
| Cc: | pgsql-hackers(at)postgresql(dot)org, Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, Peter Geoghegan <pg(at)bowt(dot)ie>, Tomas Vondra <tv(at)fuzzy(dot)cz>, Nazir Bilal Yavuz <byavuz81(at)gmail(dot)com> |
| Subject: | Re: AIO / read stream heuristics adjustments for index prefetching |
| Date: | 2026-03-31 20:59:14 |
| Message-ID: | CAAKRu_ZcJnnxgDQaXjuhd37bnc-jKARBU4EDi+LUqgs+ZjmrgQ@mail.gmail.com |
| Views: | Whole Thread | Raw Message | Download mbox | Resend email |
| Thread: | |
| Lists: | pgsql-hackers |
On Tue, Mar 31, 2026 at 12:02 PM Andres Freund <andres(at)anarazel(dot)de> wrote:
>
> 0005+0006: Only increase distance when waiting for IO
In "aio: io_uring: Trigger async processing for large IOs" (0005), the
first sentence of the commit message is incomplete.
Is there any reason for both the io size and inflight IOs threshold to
be 4? If they should be the same, I think it would be better if this
was a macro.
This may not matter, but the old code checked in_flight_before > 5
before incrementing if for the current IO. The new code counts it
after pushing the current IO onto the submission list. So the new way
is slightly more aggressive.
0006 "(read_stream: Only increase distance when waiting for IO)" looks
good to me from a code perspective. I don't yet have ideas for
handling potential parallel bitmapheapscan regressions.
> Unfortuntely with io_uring the situation is more complicated, because
> io_uring performs reads synchronously during submission if the data is the
> kernel page cache. This can reduce performance substantially compared to
> worker, because it prevents parallelizing the copy from the page cache.
> There is an existing heuristic for that in method_io_uring.c that adds a
> flag to the IO submissions forcing the IO to be processed asynchronously,
> allowing for parallelism. Unfortunately the heuristic is triggered by the
> number of IOs in flight - which will never become big enough to tgrigger
> after using "needed to wait" to control how far to read ahead.
>
> So 0005 expands the io_uring heuristic to also trigger based on the sizes
> of IOs - but that's decidedly not perfect, we e.g. have some experiments
> showing it regressing some parallel bitmap heap scan cases. It may be
> better to somehow tweak the logic to only trigger for worker.
Trigger which logic only for worker, you mean only increasing the
distance when waiting?
> As is this has another issue, which is that it prevents IO combining in
> situations where it shouldn't, because right now using the distance to
> control both. See 0008 for an attempt at splitting those concerns.
Even if you can't combine into a single IO, it seems like a low
distance is problematic because it degrades batching and causes us to
have to call io_uring_enter for every block (I think). At least when I
was experimenting with this, the syscall overhead seemed
non-negligible. It's also true that this meant the memcpys couldn't be
parallelized, but system call overhead also seems to have been a
factor.
Setting aside more complicated prefetching systems, what it seems like
we are saying is that for all "miss" cases (not in SB) a distance of
above 1 is advantageous (unless we are only doing 1 IO). I wonder if
there is something hacky we can do like not decaying distance below
io_combine_limit if there has been a recent miss or growing it up to
at least io_combine_limit if we aren't getting all hits.
> 0007: Make read_stream_reset()/end() not wait for IO
>
> This is a quite experimental, not really correct as-is, patch to avoid
> unnecessarily waiting for in-flight IO when read_stream_reset() is done
> while there's in-flight IO. This is useful for things like nestloop
> antioins with quals on the inner side (without the qual we'd not trigger
> any readahead, as that's deferred in the index prefetching patch).
>
> As-is this will leave IOs visible in pg_aios for a while, potentially
> until the backends exit. That's not right.
Separating the problems: the handle slot exhaustion seems like it
could be solved by having the backend process discard IOs when it
needs one and there isn't any. Or is that not work we want to do in a
hot path?
The pg_aios view problems seem solvable with a flag on the IO like
"DISCARDED". But the buffers staying pinned is different. It seems
like you'll need the backend to process the discarded IOs at some
point. Maybe it should do that before idling waiting for input?
When discarding IOs, I don't really understand why the foreign IO
path, doesn't just clear its own wait ref (not the buffer descriptor
one) and move on -- instead you have it wait.
I haven't finished reviewing 0008 yet.
> One thing that's really annoying around this is that we have no infrastructure
> for testing that the heuristics keep working. It's very easy to improve one
> thing while breaking something else, without noticing, because everything
> keeps working.
Agreed that something here would be useful.
- Melanie
| From | Date | Subject | |
|---|---|---|---|
| Next Message | Masahiko Sawada | 2026-03-31 21:19:41 | Re: POC: Parallel processing of indexes in autovacuum |
| Previous Message | Tom Lane | 2026-03-31 20:36:32 | Re: docs: warn about post-data-only schema dumps with parallel restore. |