EXPLAIN: showing ReadStream / prefetch stats

From: Tomas Vondra <tomas(at)vondra(dot)me>
To: PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: EXPLAIN: showing ReadStream / prefetch stats
Date: 2026-03-15 19:49:17
Message-ID: a177a6dd-240b-455a-8f25-aca0b1c08c6e@vondra.me
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

Here's a patch adding prefetch/read-ahead info about IO to EXPLAIN. This
was initially developed as part of the index prefetch patches, but it
became clear it can be useful for other nodes using ReadStream. That
includes seqscan and bitmap heap scans, and (hopefully) will include
index scan in PG19.

The 0001 patch is "prerequisite" adjusting WaitReadBuffers to indicate
if the buffer read had to wait for the I/O. 0002 is the part this thread
is really about.

In EXPLAIN, this new info is enabled by a new option "IO", which adds
two new lines to the output:

--------------------
EXPLAIN (ANALYZE, COSTS off, TIMING off, IO)
SELECT * FROM t WHERE a < 100000;

QUERY PLAN
-------------------------------------------------
Seq Scan on t (actual rows=999996.00 loops=1)
Filter: (a < 100000)
Rows Removed by Filter: 4
Prefetch: avg=262.629 max=271 capacity=272
I/O: stalls=653 size=15.983 inprogress=15.934
Buffers: shared read=55556
Planning:
Buffers: shared hit=50 read=23
Planning Time: 7.358 ms
Execution Time: 358.214 ms
(10 rows)
--------------------

The first line "Prefetch" tracks the look-ahead distance, i.e. how many
blocks ahead the ReadStream is requesting.

The second line "I/O" is about the I/O requests actually issued - how
many times we had to wait for the block (when we get to process it),
average size of a request (in BLCKSZ blocks), and average number of
in-progress requests.

The above example shows we've been reading ~262 blocks ahead average,
with theoretical maximum of 272 (and at some point we read 271 blocks).
The average is pretty close to the maximum, so we've been reading ~2MB
ahead, which seems pretty good.

The I/O stats says we've been reading in 16-block requests, so 128kB
chunks (which aligns with io_combine_limit=16). And when issuing a new
I/O, there were ~16 requests in-progress already (which aligns with the
maximum distance, because 16*17=272).

We could track much more information. The WIP patches tracked histograms
of various metrics (distances, sizes, ...), and it was quite handy
during development. For regular EXPLAIN we chose to select only the most
useful subset, to make it meaningful and also cheap to collect. And also
generic enough to work for other table AM implementations.

Regarding the basic design, and a couple topics for review questions:

1) The information is collected by ReadStream, unconditionally.

I experimented with having a flag to "request" collecting these stats
(so that it can be done only for EXPLAIN ANALYZE), but that turned out
pretty useless. The stats are super cheap, so the extra flag did not
make a meaningful difference. And passing the flag to the scan/stream
was way too invasive. So all ReadStreams track it.

There's a ReadStreamInstrumentation defined in instrumet_node.h, but I'm
not convinced that's the right place. Maybe it should be defined in
read_stream.h instead.

2) This is inherently a TAM implementation detail.

The ReadStream is "inside" the TAM, and some TAMs may not even use a
ReadStream to do I/O. So EXPLAIN can't just inspect the stream directly,
that'd violate the layering. It needs to do it through the TAM API.

So this introduces a new TAM callback "scan_stats" which gets a scan
descriptor, and extracts stats from the stream (if any). The index
prefetching will need a second callback for IndexScanDesc.

It also introduces a "generic" TableScanStatsData struct, so that it
does not work with ReadStreamInstrumentation (because what if the TAM
does not use a ReadStream). Of course, for heap AM it's 1:1.

For heap AM the callback is pretty simple, it just collects the read
stream stats and "translates" it to the TableScanStats instance.

3) Adds shared instrumentation to SeqScan workers.

The patch also had to improve a parallel SeqScan to collect per-worker
instrumentation, similarly to a BitmapHeapScan. Until now this was not
needed, but it's needed for tracking per-worker prefetch stats.

Here's a couple questions I'm asking myself about this patch:

- Is the selected information really the minimal meaningful set of stats
we could track? Is it sufficiently clear what we're tracking, and/or
should we track something else?

- Is the TableScanStatsData generic enough? Can other TAMs (not using
ReadStream) use this to show meaningful stats?

- If the separation between TAM and the low-level instrumentation clear
enough? Or is the ReadStreamInstrumentation "leaking" somewhere? For
example, is it OK it's in SeqScanInstrumentation?

But I'm sure there are other questions I haven't thought of.

regards

--
Tomas Vondra

Attachment Content-Type Size
v1-0001-bufmgr-Return-whether-WaitReadBuffers-needed-to-w.patch text/x-patch 2.8 KB
v1-0002-explain-show-prefetch-stats-in-EXPLAIN-ANALYZE-VE.patch text/x-patch 34.7 KB

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2026-03-15 20:09:39 Re: Inconsistency in owner assignment between INDEX and STATISTICS
Previous Message Jonathan Gonzalez V. 2026-03-15 19:36:44 Re: Make PGOAUTHCAFILE in libpq-oauth work out of debug mode