From: | Tomas Vondra <tomas(at)vondra(dot)me> |
---|---|
To: | Peter Geoghegan <pg(at)bowt(dot)ie> |
Cc: | Andres Freund <andres(at)anarazel(dot)de>, Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, Nazir Bilal Yavuz <byavuz81(at)gmail(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Melanie Plageman <melanieplageman(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Georgios <gkokolatos(at)protonmail(dot)com>, Konstantin Knizhnik <knizhnik(at)garret(dot)ru>, Dilip Kumar <dilipbalaut(at)gmail(dot)com> |
Subject: | Re: index prefetching |
Date: | 2025-08-13 23:11:07 |
Message-ID: | dfb34cd5-9e99-41aa-b76f-15d449fbd3d2@vondra.me |
Views: | Whole Thread | Raw Message | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
On 8/13/25 23:57, Peter Geoghegan wrote:
> On Wed, Aug 13, 2025 at 5:19 PM Tomas Vondra <tomas(at)vondra(dot)me> wrote:
>> It's also not very surprising this happens with backwards scans more.
>> The I/O is apparently much slower (due to missing OS prefetch), so we're
>> much more likely to hit the I/O limits (max_ios and various other limits
>> in read_stream_start_pending_read).
>
> But there's no OS prefetch with direct I/O. At most, there might be
> some kind of readahead implemented in the SSD's firmware.
>
Good point, I keep forgetting direct I/O means no OS read-ahead. Not
sure if there's a good way to determine if the SSD can do something like
that (and how well). I wonder if there's a way to do backward sequential
scans in fio ..
> Even assuming that the SSD issue is relevant, I can't help but suspect
> that something is off here. To recap from yesterday, the forwards scan
> showed "I/O Timings: shared read=45.313" and "Execution Time: 330.379
> ms" on my system, while the equivalent backwards scan showed "I/O
> Timings: shared read=194.774" and "Execution Time: 1236.655 ms". Does
> that kind of disparity *really* make sense with a modern NVME SSD such
> as this (I use a Samsung 980 pro), in the context of a scan that can
> use aggressive prefetching? Are we really, truly operating at the
> limits of what is possible with this hardware, for this backwards
> scan?
>
Hard to say. Would be interesting to get some numbers using fio. I'll
try to do that for my devices.
The timings I see on my ryzen (which has a RAID0 with 4 samsung 990
pro), I see these stats:
1) Q1 ASC
Buffers: shared hit=4545 read=52801
I/O Timings: shared read=127.700
Execution Time: 432.266 ms
2) Q1 DESC
Buffers: shared hit=7406 read=52801
I/O Timings: shared read=306.676
Execution Time: 769.246 ms
3) Q2 ASC
Buffers: shared hit=32605 read=52801
I/O Timings: shared read=127.610
Execution Time: 1047.333 ms
4) Q2 DESC
Buffers: shared hit=36105 read=52801
I/O Timings: shared read=157.667
Execution Time: 1140.286 ms
Those timings are much better (more stable) that the numbers I shared
yesterday (that was from my laptop).
All of this is with direct I/O and 12 workers.
> What if I use a ramdisk for this? That'll be much faster, no matter
> the scan order. Should I expect this step to make the effect with
> duplicates being produced by read_stream_look_ahead to just go away,
> regardless of the scan direction in use?
>
How's that different from just running with buffered I/O and not
dropping the page cache?
regards
--
Tomas Vondra
From | Date | Subject | |
---|---|---|---|
Next Message | Tom Lane | 2025-08-13 23:14:13 | Re: [PATCH] Silence Valgrind about SelectConfigFiles() |
Previous Message | Aleksander Alekseev | 2025-08-13 23:04:18 | [PATCH] Silence Valgrind about SelectConfigFiles() |