Re: EXPLAIN: showing ReadStream / prefetch stats

From: Andres Freund <andres(at)anarazel(dot)de>
To: Tomas Vondra <tomas(at)vondra(dot)me>
Cc: Lukas Fittl <lukas(at)fittl(dot)com>, Melanie Plageman <melanieplageman(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: EXPLAIN: showing ReadStream / prefetch stats
Date: 2026-03-30 17:28:41
Message-ID: 3tcdkacocr72wx6qscztvtkfm4fuivs24vbwrocbewwtnlq5jx@dr7vgaur7ugw
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-hackers


Hi,

> Subject: [PATCH v5 1/6] bufmgr: Return whether WaitReadBuffers() needed to
> wait
>
> In a subsequent commit read_stream.c will use this as an input to the read
> ahead distance.

Note that this actually depends on a prior commit to work for io_uring...

> From 410eaaebe7b814ac9f44c080e153f4ec1d6d6b86 Mon Sep 17 00:00:00 2001
> From: Tomas Vondra <tomas(at)vondra(dot)me>
> Date: Thu, 19 Mar 2026 22:25:09 +0100
> Subject: [PATCH v5 3/6] explain: show prefetch stats in EXPLAIN (ANALYZE)
>
> This adds details about AIO / prefetch for executor nodes using a
> ReadStream. Right now this applies only to BitmapHeapScan, because
> that's the only scan node using a ReadStream and collecting
> instrumentation from workers.

I don't understand why that means it should be done as part of this commit,
whereas seqscans shouldn't?

> The stats are collected by the ReadStream unconditionally, i.e. without
> EXPLAIN ANALYZE etc. The amount of statistics is trivial (a handful of
> integer counters), it's not worth gating this by a flag.

I think this is not true anymore.

> --- a/doc/src/sgml/ref/explain.sgml
> +++ b/doc/src/sgml/ref/explain.sgml
> @@ -46,6 +46,7 @@ EXPLAIN [ ( <replaceable class="parameter">option</replaceable> [, ...] ) ] <rep
> TIMING [ <replaceable class="parameter">boolean</replaceable> ]
> SUMMARY [ <replaceable class="parameter">boolean</replaceable> ]
> MEMORY [ <replaceable class="parameter">boolean</replaceable> ]
> + IO [ <replaceable class="parameter">boolean</replaceable> ]
> FORMAT { TEXT | XML | JSON | YAML }
> </synopsis>
> </refsynopsisdiv>
> @@ -298,6 +299,17 @@ ROLLBACK;
> </listitem>
> </varlistentry>
>
> + <varlistentry>
> + <term><literal>IO</literal></term>
> + <listitem>
> + <para>
> + Include information on I/O performed by each node.
> + This parameter may only be used when <literal>ANALYZE</literal> is also
> + enabled. It defaults to <literal>TRUE</literal>.
> + </para>
> + </listitem>
> + </varlistentry>
> +
> <varlistentry>
> <term><literal>FORMAT</literal></term>
> <listitem>

I wonder if it's sane to default this to true. The amount of adjustments in
the regression tests required in the later commits makes me a bit wary. But I
guess just going around and adding a few dozen OPTION OFF is what we've lately
been doing, so perhaps it's ok.

> @@ -1198,6 +1198,7 @@ heap_beginscan(Relation relation, Snapshot snapshot,
> scan->rs_base.rs_nkeys = nkeys;
> scan->rs_base.rs_flags = flags;
> scan->rs_base.rs_parallel = parallel_scan;
> + scan->rs_base.rs_instrument = NULL;
> scan->rs_strategy = NULL; /* set in initscan */
> scan->rs_cbuf = InvalidBuffer;
>
> @@ -1267,6 +1268,12 @@ heap_beginscan(Relation relation, Snapshot snapshot,
>
> scan->rs_read_stream = NULL;
>
> + /* allocate instrumentation */
> + if (flags & SO_SCAN_INSTRUMENT)
> + scan->rs_base.rs_instrument = palloc0_object(TableScanInstrumentation);
> + else
> + scan->rs_base.rs_instrument = NULL;
> +
> /*
> * Set up a read stream for sequential scans and TID range scans. This
> * should be done after initscan() because initscan() allocates the
> @@ -1296,7 +1303,8 @@ heap_beginscan(Relation relation, Snapshot snapshot,
> MAIN_FORKNUM,
> cb,
> scan,
> - 0);
> + 0,
> + &scan->rs_base.rs_instrument->io);
> }
> else if (scan->rs_base.rs_flags & SO_TYPE_BITMAPSCAN)
> {
> @@ -1307,7 +1315,8 @@ heap_beginscan(Relation relation, Snapshot snapshot,
> MAIN_FORKNUM,
> bitmapheap_stream_read_next,
> scan,
> - sizeof(TBMIterateResult));
> + sizeof(TBMIterateResult),
> + &scan->rs_base.rs_instrument->io);
> }
>
> scan->rs_vmbuffer = InvalidBuffer;

It doesn't seem great to dereference rs_base.rs_instrument if it's NULL, even
if that works because you then just compute the address of that. I think
you'll end up passing the offset of io in TableScanInstrumentation this way.

> +static void
> +print_io_usage(ExplainState *es, IOStats *stats)
> +{
> + /* don't print stats if there's nothing to report */
> + if (stats->prefetch_count > 0)
> + {
> + if (es->format == EXPLAIN_FORMAT_TEXT)
> + {
> + /* prefetch distance info */
> + ExplainIndentText(es);
> + appendStringInfo(es->str, "Prefetch: avg=%.3f max=%d capacity=%d",
> + (stats->distance_sum * 1.0 / stats->prefetch_count),
> + stats->distance_max,
> + stats->distance_capacity);
> + appendStringInfoChar(es->str, '\n');
> +
> + /* prefetch I/O info (only if there were actual I/Os) */
> + if (stats->io_count > 0)
> + {
> + ExplainIndentText(es);
> + appendStringInfo(es->str, "I/O: stalls=%" PRIu64,
> + stats->stall_count);
> +
> + appendStringInfo(es->str, " size=%.3f inprogress=%.3f",
> + (stats->io_nblocks * 1.0 / stats->io_count),
> + (stats->io_in_progress * 1.0 / stats->io_count));
> +
> + appendStringInfoChar(es->str, '\n');
> + }

Why have these separate appendStringInfoChar's if you just could just include
the \n in the prior appendStringInfo?

Separately, is %.3f really the right thing? When is that degree of precision
useful here?

> diff --git a/src/backend/executor/nodeBitmapHeapscan.c b/src/backend/executor/nodeBitmapHeapscan.c
> index 7e2c1b7467b..d181fd7b8ff 100644
> --- a/src/backend/executor/nodeBitmapHeapscan.c
> +++ b/src/backend/executor/nodeBitmapHeapscan.c
> @@ -104,6 +104,7 @@ BitmapTableScanSetup(BitmapHeapScanState *node)
> TBMIterator tbmiterator = {0};
> ParallelBitmapHeapState *pstate = node->pstate;
> dsa_area *dsa = node->ss.ps.state->es_query_dsa;
> + EState *estate = node->ss.ps.state;
>
> if (!pstate)
> {
> @@ -146,7 +147,7 @@ BitmapTableScanSetup(BitmapHeapScanState *node)
> {
> node->ss.ss_currentScanDesc =
> table_beginscan_bm(node->ss.ss_currentRelation,
> - 0,
> + (estate->es_instrument) ? SO_SCAN_INSTRUMENT : 0,
> node->ss.ps.state->es_snapshot,
> 0,
> NULL);
> @@ -329,6 +330,9 @@ ExecEndBitmapHeapScan(BitmapHeapScanState *node)
> */
> si->exact_pages += node->stats.exact_pages;
> si->lossy_pages += node->stats.lossy_pages;
> +
> + /* collect prefetch info for this process from the read_stream */
> + ACCUMULATE_IO_STATS(&si->stats.io, &node->ss.ss_currentScanDesc->rs_instrument->io);
> }
>
> /*

I see no point in having ACCUMULATE_IO_STATS be a macro instead of a static
inline?

> +/*
> + * Update stream stats with current pinned buffer depth.
> + *
> + * Called once per buffer returned to the consumer in read_stream_next_buffer().
> + * Records the number of pinned buffers at that moment, so we can compute the
> + * average look-ahead depth.
> + */
> +static inline void
> +read_stream_update_stats_prefetch(ReadStream *stream)
> +{
> + IOStats *stats = stream->stats;
> +
> + if (stats == NULL)
> + return;
> +
> + stats->prefetch_count++;
> + stats->distance_sum += stream->pinned_buffers;
> + if (stream->pinned_buffers > stats->distance_max)
> + stats->distance_max = stream->pinned_buffers;
> +}
> +
> +/*
> + * Update stream stats about size of I/O requests.
> + *
> + * We count the number of I/O requests, size of requests (counted in blocks)
> + * and number of in-progress I/Os.
> + */
> +static inline void
> +read_stream_update_stats_io(ReadStream *stream, int nblocks, int in_progress)
> +{
> + IOStats *stats = stream->stats;
> +
> + if (stats == NULL)
> + return;
> +
> + stats->io_count++;
> + stats->io_nblocks += nblocks;
> + stats->io_in_progress += in_progress;
> +}
> +
> +static inline void
> +read_stream_update_stats_stall(ReadStream *stream)
> +{

Wonder if we should rename "stall" to "wait", seems like it might be easier to
understand for a more general audience.

> /* ----------------------------------------------------------------
> @@ -404,9 +471,50 @@ void
> ExecSeqScanInitializeWorker(SeqScanState *node,
> ParallelWorkerContext *pwcxt)
> {
> + EState *estate = node->ss.ps.state;
> ParallelTableScanDesc pscan;
> + char *ptr;
> + Size size;
>
> pscan = shm_toc_lookup(pwcxt->toc, node->ss.ps.plan->plan_node_id, false);
> node->ss.ss_currentScanDesc =
> - table_beginscan_parallel(node->ss.ss_currentRelation, 0, pscan);
> + table_beginscan_parallel(node->ss.ss_currentRelation,
> + (estate->es_instrument) ? SO_SCAN_INSTRUMENT : 0,
> + pscan);
> +
> + /*
> + * Workers don't get the pscan_len value in scan descriptor, so use the
> + * TAM callback again. The result has to match the earlier result in
> + * ExecSeqScanEstimate.
> + */
> + size = table_parallelscan_estimate(node->ss.ss_currentRelation,
> + estate->es_snapshot);
> +

That seems quite grotty...

> + ptr = (char *) pscan;
> + ptr += MAXALIGN(size);
> +
> + if (node->ss.ps.instrument)
> + node->sinstrument = (SharedSeqScanInstrumentation *) ptr;

Seems like the appropriate instrumentation pointers should be computed by the
leader, not in the workers.

> diff --git a/src/test/regress/expected/explain.out b/src/test/regress/expected/explain.out
> index 97cb7206676..3fa6d838545 100644
> --- a/src/test/regress/expected/explain.out
> +++ b/src/test/regress/expected/explain.out
> @@ -93,112 +93,126 @@ select explain_filter('explain (analyze, buffers, io off, format text) select *
> Execution Time: N.N ms
> (3 rows)
>
> -select explain_filter('explain (analyze, buffers, io off, format xml) select * from int8_tbl i8');
> - explain_filter
> ---------------------------------------------------------
> - <explain xmlns="http://www.postgresql.org/N/explain"> +
...
> + <explain xmlns="http://www.postgresql.org/N/explain"> +
...

Seems these tests should really use unaligned output, to avoid this stupid
"every line changes despite not really anything changing" type of diff.

Greetings,

Andres Freund

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Andrey Silitskiy 2026-03-30 17:31:04 Re: Exit walsender before confirming remote flush in logical replication
Previous Message Greg Burd 2026-03-30 17:17:38 Re: clang bug affecting greenfly