| From: | Tomas Vondra <tomas(at)vondra(dot)me> |
|---|---|
| To: | Alexandre Felipe <o(dot)alexandre(dot)felipe(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de> |
| Cc: | Peter Geoghegan <pg(at)bowt(dot)ie>, Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, Nazir Bilal Yavuz <byavuz81(at)gmail(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Melanie Plageman <melanieplageman(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Georgios <gkokolatos(at)protonmail(dot)com>, Konstantin Knizhnik <knizhnik(at)garret(dot)ru>, Dilip Kumar <dilipbalaut(at)gmail(dot)com> |
| Subject: | Re: index prefetching |
| Date: | 2026-03-01 15:03:36 |
| Message-ID: | c96ba898-02fb-4756-a1c7-0ddb08159804@vondra.me |
| Views: | Whole Thread | Raw Message | Download mbox | Resend email |
| Thread: | |
| Lists: | pgsql-hackers |
Hi,
I've decided to run a couple tests, trying to reproduce some of the
behaviors described in your (Felipe's) messages.
I'm not trying to redo the tests exactly, because (a) I don't have a M1
machine, and (b) there's not enough details about the hardware and
configuration to actually redo it properly.
I've focused on quantifying the impact of a couple things mentioned in
the previous message:
1) the distance limit
2) the profiling instrumentation
3) concurrency (multiple backends doing I/O)
I wrote a couple scripts to run two benchmarks, one focusing on (1) and
(2), and the second one focusing on (3).
Both were ran on four builds:
1) master
2) patched (index prefetch v11)
3) patched-limit (patched + distance limit)
4) patched-limit-instrument (patched-limit + instrumentation)
The scripts initialize an instance, vary a couple important parameters
(shared buffers, io_method, direct I/O, ...) and run index scans on a
table with either sequential or random data.
I'm attaching the full scripts, raw results, and PDFs with a nicer
version of the results.
single-client test (single-client.tgz)
--------------------------------------
The test varies the following parameters:
* buffered or direct I/O
* io_method = (worker | io_uring)
* shared_buffers = (128MB | 8GB)
* enable_indexscan_prefetch = (on | off)
* indexscan_prefetch_distance = (0, 1, 4, 16, 64, 128)
* sequential / random data (1M rows, 550MB, ~15 rows per page)
This was done on an old Xeon machine from ~2016, with a WD Ultrastar DC
SN640 960GB NVMe SSD.
The single-client.pdf shows the timings for different combinations of
parameters, branches and distance limit values. There's also a table
with timing relative to master (100% means the same as master, green =
good, red = bad).
There are literally only 4 cases where prefetching does worse than
master, and those are for random data with distance limit 1. I claim
this is irrelevant, because it literally disables prefetching while
still paying the full cost (all 4 are for io_method=worker, where the
signal overhead can be high, so it's not a surprise).
We ram up the distance exactly for this reason, that's the solution for
this overhead problem. I refuse to consider these regressions with
limit=1 a problem. It's a bit like buying a race horse, break its leg
and then complain it's not running very fast.
The overhead of the instrumentation seems relatively small, probably
within 5% or so. That's a bit less than I expected, but I still don't
understand what this is meant to say us. It's measuring wall-time, and
it's no surprise that in an I/O-bound workload most of the time is spent
in functions doing (and waiting for) I/O. Like read_stream_next_buffer.
But it does not give any indication *why*.
multi-client test (multi-client.tgz)
------------------------------------
The test varies the following parameters:
* buffered or direct I/O
* io_method = (worker | io_uring)
* io_workers = (12 | 32)
* shared_buffers = (128MB | 8GB)
* enable_indexscan_prefetch = (on | off)
* indexscan_prefetch_distance = (0, 1, 4, 16, 64, 128)
* sequential / random data (1M rows, 550MB, ~15 rows per page)
* number of parallel workers (1, 2, 4, 8)
This was done on a Ryzen 9 machine from ~2023, with 4x Samsung 990 PRO
1TB drives in RAID0.
The test prepares a separate table for each worker, and then runs the
index scans concurrently (and "syncs" the workers to start at the same
time). It measures the duration, and we can compare it to the timing
from master (without prefetching).
The multi-client-full.pdf has detailed results for all parameters, but
as I said I don't think the distance limit (particularly for limit 1) is
interesting.
The multi-client-simple.pdf shows only results for limit=0 (i.e. without
limit), and is hopefully easier to understand. The first table shows
timings for each combination, the second table shows timing relative to
master (for the same number of workers etc.).
The results are pretty positive. For random data (which is about the
worst case for I/O), it's consistently faster than master. Yes, the
gains with 8 workers is not as significant as with 1 worker. For
example, it may look like this:
master prefetch
1 worker: 2960 1898 64%
8 workers: 5585 5361 96%
But that's not a huge surprise. The storage has a limited throughput,
and at some point it gets saturated. Whether it's by prefetching, or by
having multiple workers does not matter.
For sequential data (which is what you did in your examples) it's much
simpler. For buffered there's not much benefit, because page cache does
read-ahead with mostly the same effect, or there's a nice consistent
speedup for direct I/O.
This all seems perfectly fine to me. The bad behavior would be if the
prefetching gets slower than master, because that would be a regression
affecting users. But that happens only in 4 cells in the table. My guess
is it hits some limit on the number of signals the system can process.
The random data set is not great for this, it's worse with more workers,
and the 128MB buffers make that even worse. This is a bit of perfect
storm, and it's already there - bitmap scans can hit that too, AFAICS.
(But I'm speculating, I haven't investigated this in detail yet.)
Moreover, io_uring does not have this issue. Which is another indication
it's something about the signal overhead.
In any case, these results clearly prefetching can be a huge improvement
even in environments with concurrent activity, etc.
If you see something different on the Mac, you need to investigate why.
It could be something in the OS, or maybe it it's hardware specific
thing (consumer SSDs can choke on too many requests). Hard to say. I
don't even know what kind of M1 machine you have, what SSD etc.
regards
--
Tomas Vondra
| Attachment | Content-Type | Size |
|---|---|---|
| single-client.pdf | application/pdf | 77.5 KB |
| multi-client-simple.pdf | application/pdf | 68.8 KB |
| multi-client-full.pdf | application/pdf | 92.2 KB |
| single-client.tgz | application/x-compressed-tar | 1.6 KB |
| multi-client.tgz | application/x-compressed-tar | 1.9 KB |
| single-client.csv.gz | application/gzip | 14.4 KB |
| multi-client.csv.gz | application/gzip | 151.6 KB |
| From | Date | Subject | |
|---|---|---|---|
| Next Message | David G. Johnston | 2026-03-01 15:32:02 | Re: Partial Mode in Aggregate Functions |
| Previous Message | Marcos Pegoraro | 2026-03-01 15:00:09 | Re: Partial Mode in Aggregate Functions |