From: | Tomas Vondra <tomas(at)vondra(dot)me> |
---|---|
To: | Andres Freund <andres(at)anarazel(dot)de> |
Cc: | Peter Geoghegan <pg(at)bowt(dot)ie>, Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, Nazir Bilal Yavuz <byavuz81(at)gmail(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Melanie Plageman <melanieplageman(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Georgios <gkokolatos(at)protonmail(dot)com>, Konstantin Knizhnik <knizhnik(at)garret(dot)ru>, Dilip Kumar <dilipbalaut(at)gmail(dot)com> |
Subject: | Re: index prefetching |
Date: | 2025-08-26 15:06:11 |
Message-ID: | 0dd33755-cab8-49c8-b1ed-698732577fbb@vondra.me |
Views: | Whole Thread | Raw Message | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
On 8/26/25 01:48, Andres Freund wrote:
> Hi,
>
> On 2025-08-25 15:00:39 +0200, Tomas Vondra wrote:
>> Thanks. Based on the testing so far, the patch seems to be a substantial
>> improvement. What's needed to make this prototype committable?
>
> Mainly some testing infrastructure that can trigger this kind of stream. The
> logic is too finnicky for me to commit it without that.
>
So, what would that look like? The "naive" approach to testing is to
simply generate a table/index, producing the right sequence of blocks.
That shouldn't be too hard, it'd be enough to have an index that
- has ~2-3 rows per value, on different heap pages
- the values "overlap", e.g. like this (value,page)
(A,1), (A,2), (A,3), (B,2), (B,3), (B,4), ...
Another approach would be to test this at C level, sidestepping the
query execution entirely. We'd have a "stream generator" that just
generates a sequence of blocks of our own choosing (could be hard-coded,
some pattern, read from a file ...), and feed it into a read stream.
But how would we measure success for these tests? I don't think we want
to look at query duration, that's very volatile.
>
>> I assume this is PG19+ improvement, right? It probably affects PG18 too,
>> but it's harder to hit / the impact is not as bad as on PG19.
>
> Yea. It does apply to 18 too, but I can't come up with realistic scenarios
> where it's a real issue. I can repro a slowdown when using many parallel
> seqscans with debug_io_direct=data - but that's even slower in 17...
>
Makes sense.
>
>> On a related note, my test that generates random datasets / queries, and
>> compares index prefetching with different io_method values found a
>> pretty massive difference between worker and io_uring. I wonder if this
>> might be some issue in io_method=worker.
>
>> while with index prefetching (with the aio prototype patch), it looks
>> like this:
>>
>> QUERY PLAN
>> ----------------------------------------------------------------------
>> Index Scan using idx on t (actual rows=9048576.00 loops=1)
>> Index Cond: ((a >= 16150) AND (a <= 4540437))
>> Index Searches: 1
>> Prefetch Distance: 2.032
>> Prefetch Count: 868165
>> Prefetch Stalls: 2140228
>> Prefetch Skips: 6039906
>> Prefetch Resets: 0
>> Stream Ungets: 0
>> Stream Forwarded: 4
>> Prefetch Histogram: [2,4) => 855753, [4,8) => 12412
>> Buffers: shared hit=2577599 read=455610
>> Planning:
>> Buffers: shared hit=78 read=26 dirtied=1
>> Planning Time: 1.032 ms
>> Execution Time: 3150.578 ms
>> (16 rows)
>>
>> So it's about 2x slower. The prefetch distance collapses, because
>> there's a lot of cache hits (about 50% of requests seem to be hits of
>> already visited blocks). I think that's a problem with how we adjust the
>> distance, but I'll post about that separately.
>>
>> Let's try to simply set io_method=io_uring:
>>
>> QUERY PLAN
>> ----------------------------------------------------------------------
>> Index Scan using idx on t (actual rows=9048576.00 loops=1)
>> Index Cond: ((a >= 16150) AND (a <= 4540437))
>> Index Searches: 1
>> Prefetch Distance: 2.032
>> Prefetch Count: 868165
>> Prefetch Stalls: 2140228
>> Prefetch Skips: 6039906
>> Prefetch Resets: 0
>> Stream Ungets: 0
>> Stream Forwarded: 4
>> Prefetch Histogram: [2,4) => 855753, [4,8) => 12412
>> Buffers: shared hit=2577599 read=455610
>> Planning:
>> Buffers: shared hit=78 read=26
>> Planning Time: 2.212 ms
>> Execution Time: 1837.615 ms
>> (16 rows)
>>
>> That's much closer to master (and the difference could be mostly noise).
>>
>> I'm not sure what's causing this, but almost all regressions my script
>> is finding look like this - always io_method=worker, with distance close
>> to 2.0. Is this some inherent io_method=worker overhead?
>
> I think what you might be observing might be the inherent IPC / latency
> overhead of the worker based approach. This is particularly pronounced if the
> workers are idle (and the CPU they get scheduled on is clocked down). The
> latency impact of that is small, but if you never actually get to do much
> readahead it can be visible.
>
Yeah, that's quite possible. If I understand the mechanics of this, this
can behave in a rather unexpected way - lowering the load (i.e. issuing
fewer I/O requests) can make the workers "more idle" and therefore more
likely to get suspended ...
Is there a good way to measure if this is what's happening, and the
impact? For example, it'd be interesting to know how long it took for a
submitted process to get picked up by a worker. And % of time a worker
spent handling I/O.
regards
--
Tomas Vondra
From | Date | Subject | |
---|---|---|---|
Next Message | Andres Freund | 2025-08-26 15:08:40 | Re: Feature request: A method to configure client-side TLS ciphers for streaming replication |
Previous Message | Nikolay Samokhvalov | 2025-08-26 15:04:00 | Re: PoC: pg_dump --filter-data <file> (like Oracle Where Clause on RMAN for specific tables) |