Quick Links

Re: EXPLAIN: showing ReadStream / prefetch stats

From:	Andres Freund <andres(at)anarazel(dot)de>
To:	Tomas Vondra <tomas(at)vondra(dot)me>
Cc:	Lukas Fittl <lukas(at)fittl(dot)com>, Melanie Plageman <melanieplageman(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject:	Re: EXPLAIN: showing ReadStream / prefetch stats
Date:	2026-03-30 17:28:41
Message-ID:	3tcdkacocr72wx6qscztvtkfm4fuivs24vbwrocbewwtnlq5jx@dr7vgaur7ugw
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

Hi,

> Subject: [PATCH v5 1/6] bufmgr: Return whether WaitReadBuffers() needed to
> wait
>
> In a subsequent commit read_stream.c will use this as an input to the read
> ahead distance.

Note that this actually depends on a prior commit to work for io_uring...

> From 410eaaebe7b814ac9f44c080e153f4ec1d6d6b86 Mon Sep 17 00:00:00 2001
> From: Tomas Vondra <tomas(at)vondra(dot)me>
> Date: Thu, 19 Mar 2026 22:25:09 +0100
> Subject: [PATCH v5 3/6] explain: show prefetch stats in EXPLAIN (ANALYZE)
>
> This adds details about AIO / prefetch for executor nodes using a
> ReadStream. Right now this applies only to BitmapHeapScan, because
> that's the only scan node using a ReadStream and collecting
> instrumentation from workers.

I don't understand why that means it should be done as part of this commit,
whereas seqscans shouldn't?

> The stats are collected by the ReadStream unconditionally, i.e. without
> EXPLAIN ANALYZE etc. The amount of statistics is trivial (a handful of
> integer counters), it's not worth gating this by a flag.

I think this is not true anymore.

> --- a/doc/src/sgml/ref/explain.sgml
> +++ b/doc/src/sgml/ref/explain.sgml
> @@ -46,6 +46,7 @@ EXPLAIN [ ( <replaceable class="parameter">option</replaceable> [, ...] ) ] <rep
> TIMING [ <replaceable class="parameter">boolean</replaceable> ]
> SUMMARY [ <replaceable class="parameter">boolean</replaceable> ]
> MEMORY [ <replaceable class="parameter">boolean</replaceable> ]
> + IO [ <replaceable class="parameter">boolean</replaceable> ]
> FORMAT { TEXT | XML | JSON | YAML }
> </synopsis>
> </refsynopsisdiv>
> @@ -298,6 +299,17 @@ ROLLBACK;
> </listitem>
> </varlistentry>
>
> + <varlistentry>
> + <term><literal>IO</literal></term>
> + <listitem>
> + <para>
> + Include information on I/O performed by each node.
> + This parameter may only be used when <literal>ANALYZE</literal> is also
> + enabled. It defaults to <literal>TRUE</literal>.
> + </para>
> + </listitem>
> + </varlistentry>
> +
> <varlistentry>
> <term><literal>FORMAT</literal></term>
> <listitem>

I wonder if it's sane to default this to true. The amount of adjustments in
the regression tests required in the later commits makes me a bit wary. But I
guess just going around and adding a few dozen OPTION OFF is what we've lately
been doing, so perhaps it's ok.

> @@ -1198,6 +1198,7 @@ heap_beginscan(Relation relation, Snapshot snapshot,
> scan->rs_base.rs_nkeys = nkeys;
> scan->rs_base.rs_flags = flags;
> scan->rs_base.rs_parallel = parallel_scan;
> + scan->rs_base.rs_instrument = NULL;
> scan->rs_strategy = NULL; /* set in initscan */
> scan->rs_cbuf = InvalidBuffer;
>
> @@ -1267,6 +1268,12 @@ heap_beginscan(Relation relation, Snapshot snapshot,
>
> scan->rs_read_stream = NULL;
>
> + /* allocate instrumentation */
> + if (flags & SO_SCAN_INSTRUMENT)
> + scan->rs_base.rs_instrument = palloc0_object(TableScanInstrumentation);
> + else
> + scan->rs_base.rs_instrument = NULL;
> +
> /*
> * Set up a read stream for sequential scans and TID range scans. This
> * should be done after initscan() because initscan() allocates the
> @@ -1296,7 +1303,8 @@ heap_beginscan(Relation relation, Snapshot snapshot,
> MAIN_FORKNUM,
> cb,
> scan,
> - 0);
> + 0,
> + &scan->rs_base.rs_instrument->io);
> }
> else if (scan->rs_base.rs_flags & SO_TYPE_BITMAPSCAN)
> {
> @@ -1307,7 +1315,8 @@ heap_beginscan(Relation relation, Snapshot snapshot,
> MAIN_FORKNUM,
> bitmapheap_stream_read_next,
> scan,
> - sizeof(TBMIterateResult));
> + sizeof(TBMIterateResult),
> + &scan->rs_base.rs_instrument->io);
> }
>
> scan->rs_vmbuffer = InvalidBuffer;

It doesn't seem great to dereference rs_base.rs_instrument if it's NULL, even
if that works because you then just compute the address of that. I think
you'll end up passing the offset of io in TableScanInstrumentation this way.

> +static void
> +print_io_usage(ExplainState *es, IOStats *stats)
> +{
> + /* don't print stats if there's nothing to report */
> + if (stats->prefetch_count > 0)
> + {
> + if (es->format == EXPLAIN_FORMAT_TEXT)
> + {
> + /* prefetch distance info */
> + ExplainIndentText(es);
> + appendStringInfo(es->str, "Prefetch: avg=%.3f max=%d capacity=%d",
> + (stats->distance_sum * 1.0 / stats->prefetch_count),
> + stats->distance_max,
> + stats->distance_capacity);
> + appendStringInfoChar(es->str, '\n');
> +
> + /* prefetch I/O info (only if there were actual I/Os) */
> + if (stats->io_count > 0)
> + {
> + ExplainIndentText(es);
> + appendStringInfo(es->str, "I/O: stalls=%" PRIu64,
> + stats->stall_count);
> +
> + appendStringInfo(es->str, " size=%.3f inprogress=%.3f",
> + (stats->io_nblocks * 1.0 / stats->io_count),
> + (stats->io_in_progress * 1.0 / stats->io_count));
> +
> + appendStringInfoChar(es->str, '\n');
> + }

Why have these separate appendStringInfoChar's if you just could just include
the \n in the prior appendStringInfo?

Separately, is %.3f really the right thing? When is that degree of precision
useful here?

> diff --git a/src/backend/executor/nodeBitmapHeapscan.c b/src/backend/executor/nodeBitmapHeapscan.c
> index 7e2c1b7467b..d181fd7b8ff 100644
> --- a/src/backend/executor/nodeBitmapHeapscan.c
> +++ b/src/backend/executor/nodeBitmapHeapscan.c
> @@ -104,6 +104,7 @@ BitmapTableScanSetup(BitmapHeapScanState *node)
> TBMIterator tbmiterator = {0};
> ParallelBitmapHeapState *pstate = node->pstate;
> dsa_area *dsa = node->ss.ps.state->es_query_dsa;
> + EState *estate = node->ss.ps.state;
>
> if (!pstate)
> {
> @@ -146,7 +147,7 @@ BitmapTableScanSetup(BitmapHeapScanState *node)
> {
> node->ss.ss_currentScanDesc =
> table_beginscan_bm(node->ss.ss_currentRelation,
> - 0,
> + (estate->es_instrument) ? SO_SCAN_INSTRUMENT : 0,
> node->ss.ps.state->es_snapshot,
> 0,
> NULL);
> @@ -329,6 +330,9 @@ ExecEndBitmapHeapScan(BitmapHeapScanState *node)
> */
> si->exact_pages += node->stats.exact_pages;
> si->lossy_pages += node->stats.lossy_pages;
> +
> + /* collect prefetch info for this process from the read_stream */
> + ACCUMULATE_IO_STATS(&si->stats.io, &node->ss.ss_currentScanDesc->rs_instrument->io);
> }
>
> /*

I see no point in having ACCUMULATE_IO_STATS be a macro instead of a static
inline?

> +/*
> + * Update stream stats with current pinned buffer depth.
> + *
> + * Called once per buffer returned to the consumer in read_stream_next_buffer().
> + * Records the number of pinned buffers at that moment, so we can compute the
> + * average look-ahead depth.
> + */
> +static inline void
> +read_stream_update_stats_prefetch(ReadStream *stream)
> +{
> + IOStats *stats = stream->stats;
> +
> + if (stats == NULL)
> + return;
> +
> + stats->prefetch_count++;
> + stats->distance_sum += stream->pinned_buffers;
> + if (stream->pinned_buffers > stats->distance_max)
> + stats->distance_max = stream->pinned_buffers;
> +}
> +
> +/*
> + * Update stream stats about size of I/O requests.
> + *
> + * We count the number of I/O requests, size of requests (counted in blocks)
> + * and number of in-progress I/Os.
> + */
> +static inline void
> +read_stream_update_stats_io(ReadStream *stream, int nblocks, int in_progress)
> +{
> + IOStats *stats = stream->stats;
> +
> + if (stats == NULL)
> + return;
> +
> + stats->io_count++;
> + stats->io_nblocks += nblocks;
> + stats->io_in_progress += in_progress;
> +}
> +
> +static inline void
> +read_stream_update_stats_stall(ReadStream *stream)
> +{

Wonder if we should rename "stall" to "wait", seems like it might be easier to
understand for a more general audience.

> /* ----------------------------------------------------------------
> @@ -404,9 +471,50 @@ void
> ExecSeqScanInitializeWorker(SeqScanState *node,
> ParallelWorkerContext *pwcxt)
> {
> + EState *estate = node->ss.ps.state;
> ParallelTableScanDesc pscan;
> + char *ptr;
> + Size size;
>
> pscan = shm_toc_lookup(pwcxt->toc, node->ss.ps.plan->plan_node_id, false);
> node->ss.ss_currentScanDesc =
> - table_beginscan_parallel(node->ss.ss_currentRelation, 0, pscan);
> + table_beginscan_parallel(node->ss.ss_currentRelation,
> + (estate->es_instrument) ? SO_SCAN_INSTRUMENT : 0,
> + pscan);
> +
> + /*
> + * Workers don't get the pscan_len value in scan descriptor, so use the
> + * TAM callback again. The result has to match the earlier result in
> + * ExecSeqScanEstimate.
> + */
> + size = table_parallelscan_estimate(node->ss.ss_currentRelation,
> + estate->es_snapshot);
> +

That seems quite grotty...

> + ptr = (char *) pscan;
> + ptr += MAXALIGN(size);
> +
> + if (node->ss.ps.instrument)
> + node->sinstrument = (SharedSeqScanInstrumentation *) ptr;

Seems like the appropriate instrumentation pointers should be computed by the
leader, not in the workers.

> diff --git a/src/test/regress/expected/explain.out b/src/test/regress/expected/explain.out
> index 97cb7206676..3fa6d838545 100644
> --- a/src/test/regress/expected/explain.out
> +++ b/src/test/regress/expected/explain.out
> @@ -93,112 +93,126 @@ select explain_filter('explain (analyze, buffers, io off, format text) select *
> Execution Time: N.N ms
> (3 rows)
>
> -select explain_filter('explain (analyze, buffers, io off, format xml) select * from int8_tbl i8');
> - explain_filter
> ---------------------------------------------------------
> - <explain xmlns="http://www.postgresql.org/N/explain"> +
...
> + <explain xmlns="http://www.postgresql.org/N/explain"> +
...

Seems these tests should really use unaligned output, to avoid this stupid
"every line changes despite not really anything changing" type of diff.

Greetings,

Andres Freund

In response to

Re: EXPLAIN: showing ReadStream / prefetch stats at 2026-03-26 00:35:01 from Tomas Vondra

Responses

Re: EXPLAIN: showing ReadStream / prefetch stats at 2026-03-30 18:21:29 from Tomas Vondra

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Andrey Silitskiy	2026-03-30 17:31:04	Re: Exit walsender before confirming remote flush in logical replication
Previous Message	Greg Burd	2026-03-30 17:17:38	Re: clang bug affecting greenfly