From: | Tomas Vondra <tomas(at)vondra(dot)me> |
---|---|
To: | Andres Freund <andres(at)anarazel(dot)de> |
Cc: | Peter Geoghegan <pg(at)bowt(dot)ie>, Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, Nazir Bilal Yavuz <byavuz81(at)gmail(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Melanie Plageman <melanieplageman(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Georgios <gkokolatos(at)protonmail(dot)com>, Konstantin Knizhnik <knizhnik(at)garret(dot)ru>, Dilip Kumar <dilipbalaut(at)gmail(dot)com> |
Subject: | Re: index prefetching |
Date: | 2025-08-27 14:36:53 |
Message-ID: | 293a4735-79a4-499c-9a36-870ee9286281@vondra.me |
Views: | Whole Thread | Raw Message | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
On 8/26/25 17:06, Tomas Vondra wrote:
>
>
> On 8/26/25 01:48, Andres Freund wrote:
>> Hi,
>>
>> On 2025-08-25 15:00:39 +0200, Tomas Vondra wrote:
>>> Thanks. Based on the testing so far, the patch seems to be a substantial
>>> improvement. What's needed to make this prototype committable?
>>
>> Mainly some testing infrastructure that can trigger this kind of stream. The
>> logic is too finnicky for me to commit it without that.
>>
>
> So, what would that look like? The "naive" approach to testing is to
> simply generate a table/index, producing the right sequence of blocks.
> That shouldn't be too hard, it'd be enough to have an index that
>
> - has ~2-3 rows per value, on different heap pages
> - the values "overlap", e.g. like this (value,page)
>
> (A,1), (A,2), (A,3), (B,2), (B,3), (B,4), ...
>
> Another approach would be to test this at C level, sidestepping the
> query execution entirely. We'd have a "stream generator" that just
> generates a sequence of blocks of our own choosing (could be hard-coded,
> some pattern, read from a file ...), and feed it into a read stream.
>
> But how would we measure success for these tests? I don't think we want
> to look at query duration, that's very volatile.
>
>>
>>> I assume this is PG19+ improvement, right? It probably affects PG18 too,
>>> but it's harder to hit / the impact is not as bad as on PG19.
>>
>> Yea. It does apply to 18 too, but I can't come up with realistic scenarios
>> where it's a real issue. I can repro a slowdown when using many parallel
>> seqscans with debug_io_direct=data - but that's even slower in 17...
>>
>
> Makes sense.
>
>>
>>> On a related note, my test that generates random datasets / queries, and
>>> compares index prefetching with different io_method values found a
>>> pretty massive difference between worker and io_uring. I wonder if this
>>> might be some issue in io_method=worker.
>>
>>> while with index prefetching (with the aio prototype patch), it looks
>>> like this:
>>>
>>> QUERY PLAN
>>> ----------------------------------------------------------------------
>>> Index Scan using idx on t (actual rows=9048576.00 loops=1)
>>> Index Cond: ((a >= 16150) AND (a <= 4540437))
>>> Index Searches: 1
>>> Prefetch Distance: 2.032
>>> Prefetch Count: 868165
>>> Prefetch Stalls: 2140228
>>> Prefetch Skips: 6039906
>>> Prefetch Resets: 0
>>> Stream Ungets: 0
>>> Stream Forwarded: 4
>>> Prefetch Histogram: [2,4) => 855753, [4,8) => 12412
>>> Buffers: shared hit=2577599 read=455610
>>> Planning:
>>> Buffers: shared hit=78 read=26 dirtied=1
>>> Planning Time: 1.032 ms
>>> Execution Time: 3150.578 ms
>>> (16 rows)
>>>
>>> So it's about 2x slower. The prefetch distance collapses, because
>>> there's a lot of cache hits (about 50% of requests seem to be hits of
>>> already visited blocks). I think that's a problem with how we adjust the
>>> distance, but I'll post about that separately.
>>>
>>> Let's try to simply set io_method=io_uring:
>>>
>>> QUERY PLAN
>>> ----------------------------------------------------------------------
>>> Index Scan using idx on t (actual rows=9048576.00 loops=1)
>>> Index Cond: ((a >= 16150) AND (a <= 4540437))
>>> Index Searches: 1
>>> Prefetch Distance: 2.032
>>> Prefetch Count: 868165
>>> Prefetch Stalls: 2140228
>>> Prefetch Skips: 6039906
>>> Prefetch Resets: 0
>>> Stream Ungets: 0
>>> Stream Forwarded: 4
>>> Prefetch Histogram: [2,4) => 855753, [4,8) => 12412
>>> Buffers: shared hit=2577599 read=455610
>>> Planning:
>>> Buffers: shared hit=78 read=26
>>> Planning Time: 2.212 ms
>>> Execution Time: 1837.615 ms
>>> (16 rows)
>>>
>>> That's much closer to master (and the difference could be mostly noise).
>>>
>>> I'm not sure what's causing this, but almost all regressions my script
>>> is finding look like this - always io_method=worker, with distance close
>>> to 2.0. Is this some inherent io_method=worker overhead?
>>
>> I think what you might be observing might be the inherent IPC / latency
>> overhead of the worker based approach. This is particularly pronounced if the
>> workers are idle (and the CPU they get scheduled on is clocked down). The
>> latency impact of that is small, but if you never actually get to do much
>> readahead it can be visible.
>>
>
> Yeah, that's quite possible. If I understand the mechanics of this, this
> can behave in a rather unexpected way - lowering the load (i.e. issuing
> fewer I/O requests) can make the workers "more idle" and therefore more
> likely to get suspended ...
>
> Is there a good way to measure if this is what's happening, and the
> impact? For example, it'd be interesting to know how long it took for a
> submitted process to get picked up by a worker. And % of time a worker
> spent handling I/O.
>
After investigating this a bit more, I'm not sure it's due to workers
getting idle / CPU clocked down, etc. I did an experiment with booting
with idle=poll, which AFAICS should prevent cores from idling, etc.
And it made pretty much no difference - timings didn't change. It can
still be about IPC, but it does not seem to be about clocked-down cores,
or stuff like that. Maybe.
I ran a more extensive set of tests, varying additional parameters:
- iomethod: io_uring / worker (3 or 12 workers)
- shared buffers: 512MB / 16GB (table is ~3GB)
- checksums on / off
- eic: 16 / 100
- difference SSD devices
and comparing master vs. builds with different variants of the patches:
- master
- patched (index prefetching)
- no-explain (EXPLAIN ANALYZE reverted)
- munro / vondra (WIP patches preventing distance collapse)
- munro-no-explain / vondra-no-explain (should be obvious)
We've been speculating (me and Peter) maybe the extra read_stream stats
add a lot of overhead, hence the "no-explain" builds to test that. All
of this is with the recent "aio" patch eliminating I/O waits.
Attached are results from my "ryzen" machine (xeon is very similar),
sliced/colored to show patterns. It's for query:
SELECT * FROM (
SELECT * FROM t WHERE a BETWEEN 16150 AND 4540437
ORDER BY a ASC
) OFFSET 1000000000;
Which is the same query as before, except that it's not EXPLAIN ANALYZE,
and it has OFFSET so that it does not send any data back. It's a bit of
an adversarial query, it doesn't seem to benefit from prefetching.
There are some very clear patterns in the results.
In the "cold" (uncached) runs:
* io_uring does much better, with limited regressions (not negligible,
but limited compared to io_method=worker). A hint this may really be
about IPC?
* With worker, there's a massive regression with the basic prefetching
patch (when the distance collapses to 2.0). But then it mostly recovers
with the increased distance, and even does a bit better than master (or
on part with io_uring)
In the "warm" runs (with everything cached in page cache, possibly even
in shared buffers):
* With 16GB shared buffers, the regressions are about the same as for
cold runs, both for io_uring and worker. Roughly ~5%, give or take. The
extra read_stream stats seem to add ~3%.
* With 512MB it's much more complicated. io_uring regresses much more
(relative to master), for some reason. For cold runs it was ~30%, now
it's ~50%. Seems weird, but I guess there's fixed overhead and it's more
visible with data in cache.
* For worker (with buffers=512MB), the basic patch clearly causes a
massive regression, it's about 2x slower. I don't really understand why
- the assumption was this is because of idling, but is it, if it happens
with idle=poll?
In top, I see the backend takes ~60%, and the io worker ~40% (so they
clearly ping-pong the work). 40% utilization does not seem particularly
low (and with idle=poll it should not idle anyway).
I realize there's IPC with worker, and it's going to be more visible for
cases that end up doing no prefetching. But isn't 2x regression a bit
too hign? I wouldn't have expected that. Any good way to measure how
expensive the IPC is?
* With the increased prefetch distance, the regression drops to ~25%
(for worker). And in top I see the backend takes ~100%, and the single
worker uses ~60%. But the 25% is without checksums. With checksums, the
regression is roughly the 5%.
I'm not sure what to think about this.
--
Tomas Vondra
Attachment | Content-Type | Size |
---|---|---|
ryzen-cold.pdf | application/pdf | 78.6 KB |
ryzen-warm.pdf | application/pdf | 77.1 KB |
distance-test.sh | application/x-shellscript | 1.8 KB |
From | Date | Subject | |
---|---|---|---|
Next Message | Xuneng Zhou | 2025-08-27 15:23:34 | Improve read_local_xlog_page_guts by replacing polling with latch-based waiting |
Previous Message | Fabrice Chapuis | 2025-08-27 14:24:53 | Re: Issue with logical replication slot during switchover |