Quick Links

Re: EXPLAIN: showing ReadStream / prefetch stats

From:	Tomas Vondra <tomas(at)vondra(dot)me>
To:	Andres Freund <andres(at)anarazel(dot)de>
Cc:	Lukas Fittl <lukas(at)fittl(dot)com>, Melanie Plageman <melanieplageman(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject:	Re: EXPLAIN: showing ReadStream / prefetch stats
Date:	2026-03-30 18:21:29
Message-ID:	456e7294-8e05-4b44-a7ce-f639f62af69c@vondra.me
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

On 3/30/26 19:28, Andres Freund wrote:
>
> Hi,
>
>> Subject: [PATCH v5 1/6] bufmgr: Return whether WaitReadBuffers() needed to
>> wait
>>
>> In a subsequent commit read_stream.c will use this as an input to the read
>> ahead distance.
>
> Note that this actually depends on a prior commit to work for io_uring...
>

Good point.

>
>> From 410eaaebe7b814ac9f44c080e153f4ec1d6d6b86 Mon Sep 17 00:00:00 2001
>> From: Tomas Vondra <tomas(at)vondra(dot)me>
>> Date: Thu, 19 Mar 2026 22:25:09 +0100
>> Subject: [PATCH v5 3/6] explain: show prefetch stats in EXPLAIN (ANALYZE)
>>
>> This adds details about AIO / prefetch for executor nodes using a
>> ReadStream. Right now this applies only to BitmapHeapScan, because
>> that's the only scan node using a ReadStream and collecting
>> instrumentation from workers.
>
> I don't understand why that means it should be done as part of this commit,
> whereas seqscans shouldn't?
>

Are you suggesting the commit adds support for all those scans (BHS,
SeqScan and TidRangeScan) or none of them? To me it seems better to have
at least some scan because of testing. But SeqScan/TidRangeScan don't
have the instrumentation infrastructure for parallel queries, and I
don't want to do that in the main patch - it seems rather unrelated. And
I also don't want to add it before the main patch.

>
>> The stats are collected by the ReadStream unconditionally, i.e. without
>> EXPLAIN ANALYZE etc. The amount of statistics is trivial (a handful of
>> integer counters), it's not worth gating this by a flag.
>
> I think this is not true anymore.
>

Yes, that's a stale comment. Will fix.

>
>> --- a/doc/src/sgml/ref/explain.sgml
>> +++ b/doc/src/sgml/ref/explain.sgml
>> @@ -46,6 +46,7 @@ EXPLAIN [ ( <replaceable class="parameter">option</replaceable> [, ...] ) ] <rep
>> TIMING [ <replaceable class="parameter">boolean</replaceable> ]
>> SUMMARY [ <replaceable class="parameter">boolean</replaceable> ]
>> MEMORY [ <replaceable class="parameter">boolean</replaceable> ]
>> + IO [ <replaceable class="parameter">boolean</replaceable> ]
>> FORMAT { TEXT | XML | JSON | YAML }
>> </synopsis>
>> </refsynopsisdiv>
>> @@ -298,6 +299,17 @@ ROLLBACK;
>> </listitem>
>> </varlistentry>
>>
>> + <varlistentry>
>> + <term><literal>IO</literal></term>
>> + <listitem>
>> + <para>
>> + Include information on I/O performed by each node.
>> + This parameter may only be used when <literal>ANALYZE</literal> is also
>> + enabled. It defaults to <literal>TRUE</literal>.
>> + </para>
>> + </listitem>
>> + </varlistentry>
>> +
>> <varlistentry>
>> <term><literal>FORMAT</literal></term>
>> <listitem>
>
> I wonder if it's sane to default this to true. The amount of adjustments in
> the regression tests required in the later commits makes me a bit wary. But I
> guess just going around and adding a few dozen OPTION OFF is what we've lately
> been doing, so perhaps it's ok.
>

I'm not sure either. I initially added it as default OFF but it's true
it's similar to "buffers".

>
>
>> @@ -1198,6 +1198,7 @@ heap_beginscan(Relation relation, Snapshot snapshot,
>> scan->rs_base.rs_nkeys = nkeys;
>> scan->rs_base.rs_flags = flags;
>> scan->rs_base.rs_parallel = parallel_scan;
>> + scan->rs_base.rs_instrument = NULL;
>> scan->rs_strategy = NULL; /* set in initscan */
>> scan->rs_cbuf = InvalidBuffer;
>>
>> @@ -1267,6 +1268,12 @@ heap_beginscan(Relation relation, Snapshot snapshot,
>>
>> scan->rs_read_stream = NULL;
>>
>> + /* allocate instrumentation */
>> + if (flags & SO_SCAN_INSTRUMENT)
>> + scan->rs_base.rs_instrument = palloc0_object(TableScanInstrumentation);
>> + else
>> + scan->rs_base.rs_instrument = NULL;
>> +
>> /*
>> * Set up a read stream for sequential scans and TID range scans. This
>> * should be done after initscan() because initscan() allocates the
>> @@ -1296,7 +1303,8 @@ heap_beginscan(Relation relation, Snapshot snapshot,
>> MAIN_FORKNUM,
>> cb,
>> scan,
>> - 0);
>> + 0,
>> + &scan->rs_base.rs_instrument->io);
>> }
>> else if (scan->rs_base.rs_flags & SO_TYPE_BITMAPSCAN)
>> {
>> @@ -1307,7 +1315,8 @@ heap_beginscan(Relation relation, Snapshot snapshot,
>> MAIN_FORKNUM,
>> bitmapheap_stream_read_next,
>> scan,
>> - sizeof(TBMIterateResult));
>> + sizeof(TBMIterateResult),
>> + &scan->rs_base.rs_instrument->io);
>> }
>>
>> scan->rs_vmbuffer = InvalidBuffer;
>
> It doesn't seem great to dereference rs_base.rs_instrument if it's NULL, even
> if that works because you then just compute the address of that. I think
> you'll end up passing the offset of io in TableScanInstrumentation this way.
>

Yes, I have already noticed / fixed this in my development branch.

>
>
>> +static void
>> +print_io_usage(ExplainState *es, IOStats *stats)
>> +{
>> + /* don't print stats if there's nothing to report */
>> + if (stats->prefetch_count > 0)
>> + {
>> + if (es->format == EXPLAIN_FORMAT_TEXT)
>> + {
>> + /* prefetch distance info */
>> + ExplainIndentText(es);
>> + appendStringInfo(es->str, "Prefetch: avg=%.3f max=%d capacity=%d",
>> + (stats->distance_sum * 1.0 / stats->prefetch_count),
>> + stats->distance_max,
>> + stats->distance_capacity);
>> + appendStringInfoChar(es->str, '\n');
>> +
>> + /* prefetch I/O info (only if there were actual I/Os) */
>> + if (stats->io_count > 0)
>> + {
>> + ExplainIndentText(es);
>> + appendStringInfo(es->str, "I/O: stalls=%" PRIu64,
>> + stats->stall_count);
>> +
>> + appendStringInfo(es->str, " size=%.3f inprogress=%.3f",
>> + (stats->io_nblocks * 1.0 / stats->io_count),
>> + (stats->io_in_progress * 1.0 / stats->io_count));
>> +
>> + appendStringInfoChar(es->str, '\n');
>> + }
>
> Why have these separate appendStringInfoChar's if you just could just include
> the \n in the prior appendStringInfo?
>

Not sure, IIRC it went through multiple versions and at some point the
lines were far too long (esp. with histograms etc.).

> Separately, is %.3f really the right thing? When is that degree of precision
> useful here?
>

No idea. I think %.1f would be enough. We use %.2f in a couple places,
but that's not a huge difference.

>
>> diff --git a/src/backend/executor/nodeBitmapHeapscan.c b/src/backend/executor/nodeBitmapHeapscan.c
>> index 7e2c1b7467b..d181fd7b8ff 100644
>> --- a/src/backend/executor/nodeBitmapHeapscan.c
>> +++ b/src/backend/executor/nodeBitmapHeapscan.c
>> @@ -104,6 +104,7 @@ BitmapTableScanSetup(BitmapHeapScanState *node)
>> TBMIterator tbmiterator = {0};
>> ParallelBitmapHeapState *pstate = node->pstate;
>> dsa_area *dsa = node->ss.ps.state->es_query_dsa;
>> + EState *estate = node->ss.ps.state;
>>
>> if (!pstate)
>> {
>> @@ -146,7 +147,7 @@ BitmapTableScanSetup(BitmapHeapScanState *node)
>> {
>> node->ss.ss_currentScanDesc =
>> table_beginscan_bm(node->ss.ss_currentRelation,
>> - 0,
>> + (estate->es_instrument) ? SO_SCAN_INSTRUMENT : 0,
>> node->ss.ps.state->es_snapshot,
>> 0,
>> NULL);
>> @@ -329,6 +330,9 @@ ExecEndBitmapHeapScan(BitmapHeapScanState *node)
>> */
>> si->exact_pages += node->stats.exact_pages;
>> si->lossy_pages += node->stats.lossy_pages;
>> +
>> + /* collect prefetch info for this process from the read_stream */
>> + ACCUMULATE_IO_STATS(&si->stats.io, &node->ss.ss_currentScanDesc->rs_instrument->io);
>> }
>>
>> /*
>
> I see no point in having ACCUMULATE_IO_STATS be a macro instead of a static
> inline?
>

Both a macro and static inline would work for me.

>
>> +/*
>> + * Update stream stats with current pinned buffer depth.
>> + *
>> + * Called once per buffer returned to the consumer in read_stream_next_buffer().
>> + * Records the number of pinned buffers at that moment, so we can compute the
>> + * average look-ahead depth.
>> + */
>> +static inline void
>> +read_stream_update_stats_prefetch(ReadStream *stream)
>> +{
>> + IOStats *stats = stream->stats;
>> +
>> + if (stats == NULL)
>> + return;
>> +
>> + stats->prefetch_count++;
>> + stats->distance_sum += stream->pinned_buffers;
>> + if (stream->pinned_buffers > stats->distance_max)
>> + stats->distance_max = stream->pinned_buffers;
>> +}
>> +
>> +/*
>> + * Update stream stats about size of I/O requests.
>> + *
>> + * We count the number of I/O requests, size of requests (counted in blocks)
>> + * and number of in-progress I/Os.
>> + */
>> +static inline void
>> +read_stream_update_stats_io(ReadStream *stream, int nblocks, int in_progress)
>> +{
>> + IOStats *stats = stream->stats;
>> +
>> + if (stats == NULL)
>> + return;
>> +
>> + stats->io_count++;
>> + stats->io_nblocks += nblocks;
>> + stats->io_in_progress += in_progress;
>> +}
>> +
>> +static inline void
>> +read_stream_update_stats_stall(ReadStream *stream)
>> +{
>
> Wonder if we should rename "stall" to "wait", seems like it might be easier to
> understand for a more general audience.
>

I agree "wait" sounds clearer.

>
>> /* ----------------------------------------------------------------
>> @@ -404,9 +471,50 @@ void
>> ExecSeqScanInitializeWorker(SeqScanState *node,
>> ParallelWorkerContext *pwcxt)
>> {
>> + EState *estate = node->ss.ps.state;
>> ParallelTableScanDesc pscan;
>> + char *ptr;
>> + Size size;
>>
>> pscan = shm_toc_lookup(pwcxt->toc, node->ss.ps.plan->plan_node_id, false);
>> node->ss.ss_currentScanDesc =
>> - table_beginscan_parallel(node->ss.ss_currentRelation, 0, pscan);
>> + table_beginscan_parallel(node->ss.ss_currentRelation,
>> + (estate->es_instrument) ? SO_SCAN_INSTRUMENT : 0,
>> + pscan);
>> +
>> + /*
>> + * Workers don't get the pscan_len value in scan descriptor, so use the
>> + * TAM callback again. The result has to match the earlier result in
>> + * ExecSeqScanEstimate.
>> + */
>> + size = table_parallelscan_estimate(node->ss.ss_currentRelation,
>> + estate->es_snapshot);
>> +
>
> That seems quite grotty...
>

Yes. But what's a better solution?

>
>> + ptr = (char *) pscan;
>> + ptr += MAXALIGN(size);
>> +
>> + if (node->ss.ps.instrument)
>> + node->sinstrument = (SharedSeqScanInstrumentation *) ptr;
>
> Seems like the appropriate instrumentation pointers should be computed by the
> leader, not in the workers.
>

This is how BitmapHeapScan does it. And how would the leader do that,
when the node is private in the worker?

>
>
>> diff --git a/src/test/regress/expected/explain.out b/src/test/regress/expected/explain.out
>> index 97cb7206676..3fa6d838545 100644
>> --- a/src/test/regress/expected/explain.out
>> +++ b/src/test/regress/expected/explain.out
>> @@ -93,112 +93,126 @@ select explain_filter('explain (analyze, buffers, io off, format text) select *
>> Execution Time: N.N ms
>> (3 rows)
>>
>> -select explain_filter('explain (analyze, buffers, io off, format xml) select * from int8_tbl i8');
>> - explain_filter
>> ---------------------------------------------------------
>> - <explain xmlns="http://www.postgresql.org/N/explain"> +
> ...
>> + <explain xmlns="http://www.postgresql.org/N/explain"> +
> ...
>
> Seems these tests should really use unaligned output, to avoid this stupid
> "every line changes despite not really anything changing" type of diff.
>

Probably. I can adjust that in an initial patch.

regards

--
Tomas Vondra

In response to

Re: EXPLAIN: showing ReadStream / prefetch stats at 2026-03-30 17:28:41 from Andres Freund

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Roberto Mello	2026-03-30 18:32:17	Re: pg_publication_tables: return NULL attnames when no column list is specified
Previous Message	Sami Imseih	2026-03-30 18:16:07	Re: Add pg_stat_autovacuum_priority