| From: | Andres Freund <andres(at)anarazel(dot)de> |
|---|---|
| To: | Melanie Plageman <melanieplageman(at)gmail(dot)com> |
| Cc: | pgsql-hackers(at)postgresql(dot)org, Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, Peter Geoghegan <pg(at)bowt(dot)ie>, Tomas Vondra <tv(at)fuzzy(dot)cz>, Nazir Bilal Yavuz <byavuz81(at)gmail(dot)com> |
| Subject: | Re: AIO / read stream heuristics adjustments for index prefetching |
| Date: | 2026-04-03 23:10:48 |
| Message-ID: | 3gkuvs3lz3u3skuaxfkxnsysfqslf2srigl6546vhesekve6v2@va3r5esummvg |
| Views: | Whole Thread | Raw Message | Download mbox | Resend email |
| Thread: | |
| Lists: | pgsql-hackers |
Hi,
There are a bunch of heuristics mentioned in the following proposed commit:
On 2026-04-03 16:36:03 -0400, Andres Freund wrote:
> Subject: [PATCH v5 1/5] aio: io_uring: Trigger async processing for large IOs
>
> io_method=io_uring has a heuristic to trigger asynchronous processing of IOs
> once the IO depth is a bit larger. That heuristic is important when doing
> buffered IO from the kernel page cache, to allow parallelizing of the memory
> copy, as otherwise io_method=io_uring would be a lot slower than
> io_method=worker in that case.
>
> An upcoming commit will make read_stream.c only increase the read-ahead
> distance if we needed to wait for IO to complete. If to-be-read data is in the
> kernel page cache, io_uring will synchronously execute IO, unless the IO is
> flagged as async. Therefore the aforementioned change in read_stream.c
> heuristic would lead to a substantial performance regression with io_uring
> when data is in the page cache, as we would never reach a deep enough queue to
> actually trigger the existing heuristic.
>
> Parallelizing the copy from the page cache is mainly important when doing a
> lot of IO, which commonly is only possible when doing largely sequential IO.
>
> The reason we don't just mark all io_uring IOs as asynchronous is that the
> dispatch to a kernel thread has overhead. This overhead is mostly noticeable
> with small random IOs with a low queue depth, as in that case the gain from
> parallelizing the memory copy is small and the latency cost high.
>
> The facts from the two prior paragraphs show a way out: Use the size of the IO
> in addition to the depth of the queue to trigger asynchronous processing.
>
> One might think that just using the IO size might be enough, but
> experimentation has shown that not to be the case - with deep look-ahead
> distances being able to parallelize the memory copy is important even with
> smaller IOs.
> +/*
> + * io_uring executes IO in process context if possible. That's generally good,
> + * as it reduces context switching. When performing a lot of buffered IO that
> + * means that copying between page cache and userspace memory happens in the
> + * foreground, as it can't be offloaded to DMA hardware as is possible when
> + * using direct IO. When executing a lot of buffered IO this causes io_uring
> + * to be slower than worker mode, as worker mode parallelizes the
> + * copying. io_uring can be told to offload work to worker threads instead.
> + *
> + * If the IOs are small, we only benefit from forcing things into the
> + * background if there is a lot of IO, as otherwise the overhead from context
> + * switching is higher than the gain.
> + *
> + * If IOs are large, there is benefit from asynchronous processing at lower
> + * queue depths, as IO latency is less of a crucial factor and parallelizing
> + * memory copies is more important. In addition, it is important to trigger
> + * asynchronous processing even at low queue depth, as with foreground
> + * processing we might never actually reach deep enough IO depths to trigger
> + * asynchronous processing, which in turn would deprive readahead control
> + * logic of information about whether a deeper look-ahead distance would be
> + * advantageous.
> + *
> + * We have done some basic benchmarking to validate the thresholds used, but
> + * it's quite plausible that there are better values.
Thought it'd be useful to actually have an email to point to in the above
comment, with details about what benchmark I ran.
Previously I'd just manually run fio with different options, I made it a bit
more systematic with the attached (only halfway hand written) script.
I attached two different results, once when allowing access to multiple cores,
and once with a single core (simulating a very busy machine).
(nblocks is in multiples of 8KB)
Multi-core:
nblocks iod async bw_gib_s lat_usec
1 1 0 4.2075 1.5802
1 1 1 1.0462 6.9652
1 2 0 4.1362 3.4533
1 2 1 1.9284 7.6040
1 4 0 4.0030 7.3720
1 4 1 4.2713 6.9086
1 8 0 4.1653 14.4072
1 8 1 4.3301 13.8365
1 16 0 4.1829 28.9216
1 16 1 4.3006 28.1261
1 32 0 4.0735 59.6232
1 32 1 4.3248 56.1614
I.e at nblocks=1, there's pretty much no gain from async, and the latency
increases markedly at the low end and just about catches up at the high end.
Around an iodepth 4 the loss from async nonexistant or minimal.
2 1 0 5.7289 2.4261
2 1 1 1.8708 7.7466
2 2 0 5.7964 5.0144
2 2 1 3.3749 8.7417
2 4 0 5.8434 10.2023
2 4 1 7.9783 7.3977
2 8 0 5.8166 20.7226
2 8 1 8.2545 14.5431
2 16 0 5.8215 41.6613
2 16 1 8.2354 29.3879
2 32 0 5.6530 86.0286
2 32 1 8.3218 58.3826
With nblocks=2, there start to be gains at higher IO depths, but they're still
somewhat limited. Latency already starts to be better at iodepth 4.
4 1 0 7.4131 3.8807
4 1 1 3.2133 9.1827
4 2 0 7.3150 8.0854
4 2 1 5.4983 10.8039
4 4 0 7.2784 16.5097
4 4 1 11.2717 10.5699
4 8 0 7.2873 33.2331
4 8 1 16.6299 14.4164
4 16 0 7.1606 67.8777
4 16 1 16.9794 28.4981
4 32 0 6.2954 154.6834
4 32 1 16.3686 59.3610
With nblocks=4, async shows much more substantial gains. Latency of async at
the high end is also much improved.
8 1 0 8.0403 7.3503
8 1 1 4.6038 12.7202
8 2 0 8.0052 14.9161
8 2 1 8.5176 13.9987
8 4 0 8.1519 29.6698
8 4 1 14.8211 16.1640
8 8 0 7.8525 61.8612
8 8 1 27.5860 17.4434
8 16 0 6.8887 141.3268
8 16 1 34.1307 28.3463
8 32 0 6.9031 282.2350
8 32 1 38.2430 50.7700
With nblocks=8, async is faster already at iodepth 2.
64 1 0 9.1983 52.6768
64 1 1 8.1505 59.5486
128 1 0 7.5442 128.8704
128 1 1 7.3481 132.2355
Somewhere nblocks=64 and 128, we reach the point where there's basically no
loss at iodepth 1.
This seems to validate setting IOSQE_ASYNC around a block size of >= 4 and a
queue depth of > 4. I guess it could make sense to reduce it from > 4 to >= 4
based on these numbers, but I don't think it matters terribly.
Obviously with just one core there will only ever be a loss from doing an
asynchronous / concurrent copy from the page cache. But it's interesting to
see where the overhead of async starts to be less of a factor.
At iodepth 1 (worse case, a context switch for every IO)
nblocks iod async bw_gib_s lat_usec
1 1 0 4.2324 1.5692
1 1 1 1.7883 3.9574
2.36x bw regression
2 1 0 5.7914 2.4004
2 1 1 2.9585 4.8417
1.96x bw regression
4 1 0 7.3171 3.9242
4 1 1 4.2450 6.8171
1.7x bw regression
8 1 0 8.1162 7.2674
8 1 1 5.7536 10.2948
1.4x bw regression
16 1 0 8.8559 13.5212
16 1 1 7.1163 16.8277
1.6x bw regression
But the IO depth would not stay at 1 in the case of postgres with the proposed
changes, it'd ramp up due to needing to wait for the kernel to complete those
IOs asynchronously.
Therefore comparing that to a deeper IO depth.
nblocks iod async bw_gib_s lat_usec
1 16 0 4.1094 29.4339
1 16 1 3.3922 35.7044
1.21x bw regression
2 16 0 5.8381 41.5402
2 16 1 4.8104 50.4571
1.21x bw regression
4 16 0 7.1204 68.2612
4 16 1 5.6479 86.0973
1.26x bw regression
8 16 0 7.0780 137.5520
8 16 1 6.1687 157.8805
1.14x bw regression
16 16 0 7.4523 261.4281
16 16 1 6.7192 290.0837
1.10x bw regression
This assumes a very extreme scenario (no cycles whatsoever available for
parallelism), so I'm just looking for the worst case regression here.
I don't think there's very clear indicators for what cutoffs to use in the
onecpu data. Clearly we shouldn't go for async for single block IOs, but we
aren't. With the default io_combine_limit=16 effective_io_concurrency=16,
we'd end up with 1.10x regression in the extreme case of only having a single
core available (but that one fully!) and doing nothing other than IO.
Seems ok to me.
I ran it on three other machines (newer workstation, laptop, old laptop) as
well, with similarly shaped results (although considerably higher & lower
throughputs across the board, depending on the machine).
Zen 4 Laptop:
nblocks iod async bw_gib_s lat_usec
1 1 0 6.0989 1.1408
1 1 1 1.4477 5.1246
1 2 0 6.9600 2.0827
1 2 1 2.8750 5.1711
1 4 0 7.0283 4.2307
1 4 1 8.9174 3.3169
Suprisingly bigger difference between sync/async at iod=1, but it's again
similar around iod=4 blocks.
4 1 0 14.5638 1.9616
4 1 1 5.1245 5.8016
4 2 0 14.8867 3.9607
4 2 1 12.1841 4.8662
4 4 0 14.8678 8.0764
4 4 1 21.5077 5.5417
Similar.
16 1 0 21.0754 5.5891
16 1 1 12.6180 9.4753
16 2 0 20.2770 11.8353
16 2 1 24.3277 9.8172
At nblocks=16, iod=2 starts already starts to be faster.
Greetings,
Andres Freund
| Attachment | Content-Type | Size |
|---|---|---|
| bench_async_uring.py | text/x-python | 3.0 KB |
| results_manycore.tsv | text/tab-separated-values | 2.1 KB |
| results_onecore.tsv | text/tab-separated-values | 2.1 KB |
| From | Date | Subject | |
|---|---|---|---|
| Next Message | Andreas Karlsson | 2026-04-03 23:15:59 | Re: [doc] pg_ctl: fix wrong description for -l |
| Previous Message | Daniel Gustafsson | 2026-04-03 22:59:55 | Re: Changing the state of data checksums in a running cluster |