Re: EXPLAIN: showing ReadStream / prefetch stats

From: Tomas Vondra <tomas(at)vondra(dot)me>
To: Lukas Fittl <lukas(at)fittl(dot)com>
Cc: Andres Freund <andres(at)anarazel(dot)de>, Melanie Plageman <melanieplageman(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: EXPLAIN: showing ReadStream / prefetch stats
Date: 2026-03-19 20:57:32
Message-ID: 2551c884-af06-4663-83d2-cd5688c3ae12@vondra.me
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi Lukas,

Thanks for the review. I've been working on cleaning up the patch,
reordering the changes a bit. I ran into most of the issues you
mentioned, and the attached v4 should address most of them.

I got rid of the "old" approach entirely, and just do everything through
the scan descriptor. I reorganized the patch series to make more sense -
it's split into three pieces (0001 is a prerequisite patch, not really a
part of the changes):

* 0002 - Basic changes, adding the info to a single scan node that
already uses shared instrumentation (=BitmapHeapScan). It adds the
counting to read_stream, and printing to explain. It also adjusts all
the failing tests checking query plans (usually by disabling I/O).

* 0003 - Adds support for SeqScan (has to setup the instrumentation for
parallel workers, ...).

* 0004 - Adds support for TidRangeScan (has to setup the instrumentation
for parallel workers, ...).

I think it'd make sense to commit these parts separately.

The 0002 part could be made smaller by introducing a new read_stream
helper, so that we don't need to add a parameter to all the places
calling read_stream_begin_relation(). But that's a detail.

More comments inline.

On 3/19/26 19:11, Lukas Fittl wrote:
> On Wed, Mar 18, 2026 at 3:41 PM Tomas Vondra <tomas(at)vondra(dot)me> wrote:
>> The 0003 also changes the EXPLAIN to enable IO by default, just like we
>> do for BUFFERS. It seems like a reasonable precedent to me.
>
> One side effect of that is that the tests now fail for me locally,
> because the specific values are system-dependent. Attached a patch
> (nocfbot-0002) that fixed that for me.
>

Right. I haven't addressed that in v3, but with v4 the tests should be
passing (after each patch).

> There is one detail maybe calling out specifically on JSON output:
> Currently Postgres always emits all fields in JSON output, even if
> they are zero. The code that you have in v3 skips the "I/O" group when
> the value is zero, which doesn't work well with how current regression
> tests are written. I'm definitely not a fan of the unnecessary
> verbosity of JSON EXPLAIN output, but I'd suggest we don't break with
> the tradition here, and instead always output the "I/O" group in
> non-text formats. Also attached a patch for that (nocfbot-0001).
>

Yeah, I noticed that too, but I wasn't sure what to do about it. The I/O
stats are conditional on (io_count > 0) because it calculates average
I/O size etc. If there are no I/Os, we could probably print 0.0.

Another thing I noticed in the non-text formats is that this is the only
case with explicit groups, which adds nesting. It looks weird, so I plan
to remove that, and will add some "prefix" to the items.

> Overall I think the abstraction here seems reasonable if we're
> primarily focused on getting the per-node instrumentation taken care
> of.
>

OK, good. I think passing the stats through a scan descriptor works OK.
I still have a strange feeling about it - we have the TAM interface and
yet we choose to pass data in other ways. It's better than global
variables, of course. But it's not the only piece of data returned
through the scan descriptor, so at least there's a precedent.

> That said, two thoughts on an example EXPLAIN output I just ran:
>
> 1) I do wonder if its a bit confusing that we propagate I/O timings up
> the EXPLAIN tree, but not the "I/O" information - I realize fixing
> that would be a bit involved though, e.g. we'd have to invent
> accumulation logic in explain.c. It'd also maybe make people thing
> this covers things like temporary file reads/etc.
>

I'm not sure it makes sense to combine this information from multiple
nodes. If you have multiple read streams with very different average I/O
size and/or prefetch distances, what does the "global average" mean?

I also don't think it's very useful - the information about distance
and/or I/O size is most useful when analyzing individual nodes. What
would it tell me for multiple nodes combined?

So I don't think it's something we need, and we can add it later.

> 2) The ordering of "I/O Timings" in relation to "I/O" feels off to me
> (since they're not next to each other) - maybe we should re-order I/O
> Timings to come before Buffers in show_buffer_usage to address that?
>

Yeah, maybe. I don't think we have any explicit order, but why not.

Thanks!

--
Tomas Vondra

Attachment Content-Type Size
v4-0001-bufmgr-Return-whether-WaitReadBuffers-needed-to-w.patch text/x-patch 2.8 KB
v4-0002-explain-show-prefetch-stats-in-EXPLAIN-ANALYZE.patch text/x-patch 69.7 KB
v4-0003-show-prefetch-stats-for-SeqScan.patch text/x-patch 37.6 KB
v4-0004-show-prefetch-stats-for-TidRangeScan.patch text/x-patch 9.9 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Andrew Dunstan 2026-03-19 21:30:06 Re: pg_get__*_ddl consolidation
Previous Message Zsolt Parragi 2026-03-19 20:48:27 Re: pg_waldump: support decoding of WAL inside tarfile