Re: EXPLAIN: showing ReadStream / prefetch stats

From: Tomas Vondra <tomas(at)vondra(dot)me>
To: Lukas Fittl <lukas(at)fittl(dot)com>
Cc: Melanie Plageman <melanieplageman(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: EXPLAIN: showing ReadStream / prefetch stats
Date: 2026-04-07 09:32:28
Message-ID: 6ce690b9-0ec4-4d9e-ae9b-95a5ca3f79b2@vondra.me
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 4/7/26 10:00, Lukas Fittl wrote:
> On Mon, Apr 6, 2026 at 6:14 PM Tomas Vondra <tomas(at)vondra(dot)me> wrote:
>>
>> On 4/7/26 02:19, Melanie Plageman wrote:
>>> On Mon, Apr 6, 2026 at 4:39 PM Tomas Vondra <tomas(at)vondra(dot)me> wrote:
>>>>
>>>> On 4/6/26 18:50, Melanie Plageman wrote:
>>>> If you think you can get this pushed, I'll do my best to finalize the
>>>> instrumentation and SeqScan/TidRangeScan parts.
>>>
>>> I've pushed the first patch for index/index-only scans. Attached is
>>> the BHS fix that uses the new pattern. It needs at least one review
>>> before pushing. While I was polishing it, I realized I neglected to
>>> use add_size()/mul_size() in the index-only/index scan patches. So,
>>> 0002 is just a fix commit to do that. Feel free to push these if you
>>> think they're ready. Otherwise, I'll do so pending your review in my
>>> morning.
>>>
>>
>> Thanks!
>>
>> I'll take a look in my morning, and will consider pushing the changes so
>> that I can start pushing the parts adding the new instrumentation.
>
> FWIW, 0001 and 0002 in the v12 patch set also make sense to me from a
> code review just now.
>
>>
>> I've spent a fair amount of time looking at those parts today. Attached
>> is v12 (including the two parts you posted as v11)
>
> 0003 looks good, but commit message needs improvement / justification
> of the change (I assume you had already planned that).
>

Yes, the commit messages need improvements.

> 0004:
>
>> diff --git a/doc/src/sgml/ref/explain.sgml b/doc/src/sgml/ref/explain.sgml
>> index 5b8b521802e..a854c41e963 100644
>> --- a/doc/src/sgml/ref/explain.sgml
>> +++ b/doc/src/sgml/ref/explain.sgml
>> @@ -46,6 +46,7 @@ EXPLAIN [ ( <replaceable class="parameter">option</replaceable> [, ...] ) ] <rep
>> TIMING [ <replaceable class="parameter">boolean</replaceable> ]
>> SUMMARY [ <replaceable class="parameter">boolean</replaceable> ]
>> MEMORY [ <replaceable class="parameter">boolean</replaceable> ]
>> + IO [ <replaceable class="parameter">boolean</replaceable> ]
>> FORMAT { TEXT | XML | JSON | YAML }
>> </synopsis>
>> </refsynopsisdiv>
>> @@ -298,6 +299,17 @@ ROLLBACK;
>> </listitem>
>> </varlistentry>
>> + <varlistentry>
>> + <term><literal>IO</literal></term>
>> + <listitem>
>> + <para>
>> + Include information on I/O performed by each node.
>> + This parameter may only be used when <literal>ANALYZE</literal> is also
>> + enabled. It defaults to <literal>FALSE</literal>.
>> + </para>
>> + </listitem>
>> + </varlistentry>
> +
>
> Assuming my understanding of where this works is correct, this needs
> to be qualified I think in terms of which nodes get I/O activity
> shown. Otherwise I as the user will expect e.g. an Index Scan to show
> I/O statistics. Additionally I think this should clearly say "read
> I/O", since writes I/O information wouldn't show for any nodes.
>

Yeah, I was wondering about this too. The "each node" seems to suggest
all nodes get these stats, but then only some scan nodes actually do.
I'm a bit concerned about listing which exact nodes get the stats,
though. AFAICS the explain docs don't do that for any other option.

>> diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
>> index f151f21f9b3..863a9dd0f0d 100644
>> --- a/src/backend/commands/explain.c
>> +++ b/src/backend/commands/explain.c
>> @@ -3984,6 +3990,112 @@ show_tidbitmap_info(BitmapHeapScanState *planstate, ExplainState *es)
>> }
>> }
>> ...
>> + /* prefetch I/O info (only if there were actual I/Os) */
>> + if (stats->io_count > 0)
>> + {
>> + ExplainIndentText(es);
>> + appendStringInfo(es->str, "I/O: count=%" PRIu64 " waits=%" PRIu64
>> + " size=%.2f inprogress=%.2f\n",
>> + stats->io_count, stats->wait_count,
>> + (stats->io_nblocks * 1.0 / stats->io_count),
>> + (stats->io_in_progress * 1.0 / stats->io_count));
>> + }
>
> I've briefly pondered again whether "I/O:" is a bit too broad here in
> EXPLAIN text output, i.e. whether it should qualified it somehow, e.g.
> "I/O Stream", since its just the read stream I/O activity.
>

Hmmm, I don't think we want to mention "Stream" here, that seems more
like an implementation detail to me. Maybe "I/O Reads" or "Read I/O"
would work, though. But OTOH there were discussions about maybe using
the "I/O" both for reads and writes ...

> Right now it looks like this:
>
> Seq Scan on xxx (cost=0.00..7.14 rows=114 width=483) (actual
> time=0.140..0.222 rows=114.00 loops=1)
> Prefetch: avg=1.67 max=3 capacity=94
> I/O: count=3 waits=3 size=2.00 inprogress=1.00
> Buffers: shared read=6
> I/O Timings: shared read=0.158
>
> What if we did this instead:
>
> Seq Scan on xxx (cost=0.00..7.14 rows=114 width=483) (actual
> time=0.140..0.222 rows=114.00 loops=1)
> Prefetch: avg=1.67 max=3 capacity=94
> I/O Stream: count=3 waits=3 size=2.00 inprogress=1.00
> Buffers: shared read=6
> I/O Timings: shared read=0.158
>
> But I'm 50/50 whether that's a big enough problem to debate right now.
> Especially with index prefetching out of the picture this is actually
> less of a problem, since this should only show on scans where the I/O
> activity is mainly (fully?) relying on read streams. If we had index
> prefetching, my understanding is we would have had "I/O" show the read
> stream activity (i.e. table access), but not the index access,
> correct?
>

I think it's OK the way it is. It's possible someone will come up with a
better scheme later, in which case we can adjust this.

> Btw, in that same test, when I re-run I get Prefetch but not I/O -- why is that?
>
> Seq Scan on organizations (cost=0.00..7.42 rows=1 width=483) (actual
> time=0.034..0.042 rows=1.00 loops=1)
> Filter: (slug = 'pganalyze'::text)
> Rows Removed by Filter: 113
> Prefetch: avg=1.00 max=1 capacity=94
> Buffers: shared hit=6
>
> (if its all buffers hit, why are there any prefetches at all?)
>
>> with two or three
>> bigger changes compared to earlier versions:
>>

In the text output this is expected / intentional. The prefetch is based
on number of buffers pulled from the stream - and we use the stream even
with everything in shared buffers. We want to show the prefetch distance
etc. even in this case. But we'll do no I/O in that case, so we don't
show that line.

In the non-text formats we include the fields always.

>> 1) default OFF
>>
>> After thinking about it a bit more, I decided to change the default to
>> OFF. On the one hand I agree it's somewhat similar to BUFFERS, and that
>> option is ON by default now. But on the other hand, we must not clutter
>> EXPLAIN output with too much information, and I'm not convinced IO is
>> worth it. So I changed to OFF by default. That also makes the patches
>> smaller, due to not having to adjust that many tests.
>
> Seems fair in terms of what to do now.
>

OK

>>
>> 2) auto_explain
>>
>> It also occurred to me we should add a matching option to auto_explain,
>> similarly to log_buffers. 0005 does that.
>
> Agreed.
>

OK

>>
>> 3) INSTRUMENT_IO
>>
>> The auto_explain bit also implies we need a new INSTRUMENT_IO constant,
>> to handle this just like BUFFERS. Which also means we can actually
>> collect the stats only when the IO option is specified (instead of for
>> all ANALYZE runs). Which is nice.
>
> FWIW, it'd be the only kind that's not populated by instrument.c (and
> isn't a part of Instrumentation), but I don't see any immediate
> practical issues with having the extra flag.
>

I believe INSTRUMENT_ROWS is not populated there either, so it's not the
only flag not populated there. I've considered adding a need_io field,
but that would be useless / unused.

> It does mean that INSTRUMENT_ALL will turn on INSTRUMENT_IO though for
> the Instrumentation, which will always be a no-op. In practice the
> only users of that are the InstrAlloc calls in
> pg_stat_statements/auto_explain, which I have a patch for to refactor
> anyway (over on the stack-based instrumentation thread, which to be
> clear is PG20 material at this point).
>
> We could probably just drop INSTRUMENT_ALL in the next cycle to avoid
> confusion. For context, that use of INSTRUMENT_ALL won't affect
> es_instrument, just how queryDesc->totaltime is initialized, and what
> it captures at a summary level - and it won't know how to capture I/O
> stats since they are not part of Instrumentation.
>

It is a bit confusing, yes.

> I would suggest noting the fact that its a no-op for Instrumentation
> itself somewhere in code at least, e.g. where INSTRUMENT_IO is defined
> on InstrumentOption.

I'll consider doing that, but I'm not sure which place would be best.

>
> 0005 looks good.
>
> I've skimmed 0006 and 0007 and it makes sense to me why they exist
> (and we should add them), but I haven't reviewed the code for
> correctness.
>

Thanks!

regards

--
Tomas Vondra

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Antonin Houska 2026-04-07 09:35:50 Re: Adding REPACK [concurrently]
Previous Message Alberto Piai 2026-04-07 09:30:02 Fix ALTER COLUMN ... DROP EXPRESSSION with subpartitions