Re: EXPLAIN: showing ReadStream / prefetch stats

From: Lukas Fittl <lukas(at)fittl(dot)com>
To: Tomas Vondra <tomas(at)vondra(dot)me>
Cc: PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Andres Freund <andres(at)anarazel(dot)de>
Subject: Re: EXPLAIN: showing ReadStream / prefetch stats
Date: 2026-03-16 07:07:54
Message-ID: CAP53Pkx_x2WUtM40r6+GZMVCW5d2PE8KKYQc1=ArODn0-WYq+A@mail.gmail.com
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi Tomas,

On Sun, Mar 15, 2026 at 12:49 PM Tomas Vondra <tomas(at)vondra(dot)me> wrote:
> Here's a patch adding prefetch/read-ahead info about IO to EXPLAIN. This
> was initially developed as part of the index prefetch patches, but it
> became clear it can be useful for other nodes using ReadStream. That
> includes seqscan and bitmap heap scans, and (hopefully) will include
> index scan in PG19.

I think surfacing this to users makes a lot of sense - read streams
are hard to understand in detail, and this will make it easier to tie
back the I/O performance of a query to what happened in terms of
pre-fetching and stalls.

Big +1 on the concept of making this visible.

> In EXPLAIN, this new info is enabled by a new option "IO", which adds
> two new lines to the output:

I'm 50/50 if hiding this behind a new option really makes sense - if
its cheap enough to always capture, why not always show it?

e.g. we could consider doing with this what we did with BUFFERS
recently, which is to enable it by default. If someone finds that too
visually busy, they could still do IO OFF.

This also does make me wonder a bit what we should do about I/O
timings. Conceptually they'd belong closer to IO now than BUFFERS..

>
> --------------------
> EXPLAIN (ANALYZE, COSTS off, TIMING off, IO)
> SELECT * FROM t WHERE a < 100000;
>
> QUERY PLAN
> -------------------------------------------------
> Seq Scan on t (actual rows=999996.00 loops=1)
> Filter: (a < 100000)
> Rows Removed by Filter: 4
> Prefetch: avg=262.629 max=271 capacity=272
> I/O: stalls=653 size=15.983 inprogress=15.934
> Buffers: shared read=55556
> Planning:
> Buffers: shared hit=50 read=23
> Planning Time: 7.358 ms
> Execution Time: 358.214 ms
> (10 rows)
> --------------------
>
> The first line "Prefetch" tracks the look-ahead distance, i.e. how many
> blocks ahead the ReadStream is requesting.
>
> The second line "I/O" is about the I/O requests actually issued - how
> many times we had to wait for the block (when we get to process it),
> average size of a request (in BLCKSZ blocks), and average number of
> in-progress requests.

I wonder if we could somehow consolidate this into one line for the
text format? (specifically, moving prefetch into "I/O" at the end?)

I'm also not sure if "max" is really that useful, vs capacity?

> 1) The information is collected by ReadStream, unconditionally.
>
> I experimented with having a flag to "request" collecting these stats
> (so that it can be done only for EXPLAIN ANALYZE), but that turned out
> pretty useless. The stats are super cheap, so the extra flag did not
> make a meaningful difference. And passing the flag to the scan/stream
> was way too invasive. So all ReadStreams track it.

Makes sense. I did some testing with your patch against master, and
couldn't find any meaningful difference - its logical that this would
be cheap enough to always do.

> 2) This is inherently a TAM implementation detail.
>
> ...
>
> 3) Adds shared instrumentation to SeqScan workers.
>
> The patch also had to improve a parallel SeqScan to collect per-worker
> instrumentation, similarly to a BitmapHeapScan. Until now this was not
> needed, but it's needed for tracking per-worker prefetch stats.

I feel like something is off about the complexity of having each node
type ferry back the information. e.g. when you're implementing the
support for index prefetching, it'll require a bunch more changes. In
my mind, there is a reason we have a related problem that we solved
with the current pgBufferUsage, instead of dealing with that on a
per-node basis. I really feel we should have a more generic way of
dealing with this.

I'm saying that not being completely unbiased, because I think this
would be a great fit for the stack-based instrumentation I've been
discussing with Andres and Zsolt over at [0]. Andres at least
expressed some potential interest in getting that into 19, though its
definitely not a trivial patch set.

If we were to get stack-based instrumentation in, we could easily add
a new "IOUsage" to the Instrumentation struct, avoid any modification
to the TAM interface, and align it with how we treat BufferUsage and
WALUsage.

I've attached a prototype of how that could look like (apply the other
patch set first, v8, see commit fest entry [1] - also attached a
preparatory refactoring of using "Instrumentation" for parallel query
reporting, which avoids having individual structs there). Performance
is actually better for "EXPLAIN (ANALYZE, IO, BUFFERS OFF, TIMING
OFF)" with that patch set on a quick COUNT(*) test, but that's mostly
due to the overhead of ExecProcNode dispatching that I fixed in the
other patch set (v8/0006). I think otherwise this would be similar in
performance thanks to the stack avoiding any "accum diff" logic.

Its also worth noting that this would make it trivial to output this
information for utility commands that have read stream support, or
show aggregate statistics in pg_stat_statements/etc.

Let me know if that's at all of interest and happy to pair up on this
to make it work.

Thanks,
Lukas

[0]: https://www.postgresql.org/message-id/flat/CAP53PkzdBK8VJ1fS4AZ481LgMN8f9mJiC39ZRHqkFUSYq6KWmg%40mail.gmail.com
[1]: https://commitfest.postgresql.org/patch/6023/

--
Lukas Fittl

Attachment Content-Type Size
nocfbot-0002-Use-Instrumentation-struct-for-parallel-wor.patch text/x-patch 29.2 KB
nocfbot-0001-bufmgr-Return-whether-WaitReadBuffers-neede.patch text/x-patch 2.8 KB
nocfbot-0003-instrumentation-Track-I-O-prefetch-info-and.patch text/x-patch 22.3 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Andrey Borodin 2026-03-16 07:10:16 Re: Improve OAuth discovery logging
Previous Message Hüseyin Demir 2026-03-16 07:04:53 Re: client_connection_check_interval default value