Re: EXPLAIN: showing ReadStream / prefetch stats

From: Tomas Vondra <tomas(at)vondra(dot)me>
To: Andres Freund <andres(at)anarazel(dot)de>
Cc: Lukas Fittl <lukas(at)fittl(dot)com>, Melanie Plageman <melanieplageman(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: EXPLAIN: showing ReadStream / prefetch stats
Date: 2026-03-31 18:57:17
Message-ID: 945a7a88-061b-4158-8362-9233c1565ee6@vondra.me
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 3/31/26 20:10, Andres Freund wrote:
> Hi,
>
> On 2026-03-31 15:16:57 +0200, Tomas Vondra wrote:
>> From 1bd80f66019aea773c2b6f46212172de592efb60 Mon Sep 17 00:00:00 2001
>> From: Tomas Vondra <tomas(at)vondra(dot)me>
>> Date: Tue, 31 Mar 2026 13:44:23 +0200
>> Subject: [PATCH v6 2/5] switch explain to unaligned for json/xml/yaml
>
> LGTM.
>
>
>> From cd5437f5f49e84b8ae3f8cb8f0df4e605c1b5592 Mon Sep 17 00:00:00 2001
>> From: Tomas Vondra <tomas(at)vondra(dot)me>
>> Date: Tue, 31 Mar 2026 13:47:04 +0200
>> Subject: [PATCH v6 3/5] explain: show prefetch stats in EXPLAIN (ANALYZE)
>>
>> This adds details about AIO / prefetch for executor nodes using a
>> ReadStream. Right now this applies only to BitmapHeapScan, because
>> that's the only scan node using a ReadStream and collecting
>> instrumentation from workers.
>
> s/Right now/As of this commit ... support for ... will be added in subsequent commits/
>

OK

>
> Not for now, but I wonder if we ought to introduce a REPRODUCIBLE option for
> EXPLAIN [ANALYZE] that sets BUFFERS, COSTS, IO, ... the right way for the
> regression tests, instead having to go through and change a gazillion tests
> every time.
>

Yes, having an alias for options that we know are stable sounds useful.

>
>
>> diff --git a/contrib/amcheck/verify_heapam.c b/contrib/amcheck/verify_heapam.c
>> index 20ff58aa782..5cfe8e24615 100644
>> --- a/contrib/amcheck/verify_heapam.c
>> +++ b/contrib/amcheck/verify_heapam.c
>> @@ -477,7 +477,8 @@ verify_heapam(PG_FUNCTION_ARGS)
>> MAIN_FORKNUM,
>> stream_cb,
>> stream_data,
>> - 0);
>> + 0,
>> + NULL);
>>
>> while ((ctx.buffer = read_stream_next_buffer(stream, NULL)) != InvalidBuffer)
>> {
>
> Kinda wondering if we, instead of adding a NULL to a lot of places, should
> instead add a
> read_stream_enable_stats(stream, stats)
> or such.
>

I thought about that too, and I'll probably do it that way. I didn't do
that because having too many helpers is not great, but it seems
reasonable in this case.

>
>> --- a/src/backend/commands/explain.c
>> +++ b/src/backend/commands/explain.c
>> @@ -13,6 +13,8 @@
>> */
>> #include "postgres.h"
>>
>> +#include "gaccess/genam.h"
>
> Why genam.h?
>
>
>> +#include "access/tableam.h"
>
> I'd probably include access/relscan.h instead.
>

Ooops, this is a residue of the initial approach of doing this through
the table AM callbacks. As you say, relscan.h seems enough.

>
>
>> +/*
>> + * show_scan_io_usage
>> + * show info about prefetching for a seq/bitmap scan
>> + *
>> + * Shows summary of stats for leader and workers (if any).
>> + */
>> +static void
>> +show_scan_io_usage(ScanState *planstate, ExplainState *es)
>> +{
>> + Plan *plan = planstate->ps.plan;
>> + IOStats stats;
>> +
>> + if (!es->io)
>> + return;
>> +
>> + /* scan not started or no prefetch stats */
>> + if (!(planstate &&
>> + planstate->ss_currentScanDesc &&
>> + planstate->ss_currentScanDesc->rs_instrument))
>> + return;
>> +
>> + /* Initialize counters with stats from the local process first */
>> + switch (nodeTag(plan))
>> + {
>
> That "local node first" comment looks a bit odd here, given that at that level
> it's for a block that does both?
>

Hmmm, yeah. That comment is a bit misplaced / incomplete with the
current code. Will reword.

>
>> + case T_BitmapHeapScan:
>> + {
>> + SharedBitmapHeapInstrumentation *sinstrument
>> + = ((BitmapHeapScanState *) planstate)->sinstrument;
>> +
>> + /* collect prefetch statistics from the read stream */
>> + stats = planstate->ss_currentScanDesc->rs_instrument->io;
>> +
>> + /*
>> + * get the sum of the counters set within each and every
>> + * process
>> + */
>> + if (sinstrument)
>> + {
>> + for (int i = 0; i < sinstrument->num_workers; ++i)
>> + {
>> + BitmapHeapScanInstrumentation *winstrument = &sinstrument->sinstrument[i];
>> +
>> + AccumulateIOStats(&stats, &winstrument->stats.io);
>> + }
>
> I'm not entirely sure how useful it is to accumulate the stats from worker
> into the leader's stats. Doesn't that mean that the leader's stats can't ever
> be viewed in isolation? I'm also not really clear that it's very useful to
> just smush the stats of the different streams together.
>

This is modeled after show_indexscan_info, which does it like that.
Maybe it's not what we should do for read streams, ofc.

You're right it means the leader stats can't be viewed in isolation, but
we also don't show the worker stats without VERBOSE, so if we didn't
accumulate the stats like this we'd not get the worker I/O stats at all.
I think merging the stats like this is needed for the same reason.

Maybe we should accumulate only without VERBOSE? Then we'd show the
complete data for EXPLAIN (ANALYZE), and per-process data (both for the
leader and workers) with EXPLAIN (ANALYZE, VERBOSE)?

>> +/* ---------------------
>> + * Instrumentation information about read streams
>> + * ---------------------
>> + */
>> +typedef struct IOStats
>> +{
>> + /* number of buffers returned to consumer (for averaging distance) */
>> + uint64 prefetch_count;
>> +
>> + /* sum of pinned_buffers sampled at each buffer return */
>> + uint64 distance_sum;
>> +
>> + /* maximum actual pinned_buffers observed during the scan */
>> + int16 distance_max;
>> +
>> + /* maximum possible look-ahead distance (max_pinned_buffers) */
>> + int16 distance_capacity;
>> +
>> + /* number of waits for a read (for the I/O) */
>> + uint64 wait_count;
>> +
>> + /* I/O stats */
>> + uint64 io_count; /* number of I/Os */
>> + uint64 io_nblocks; /* sum of blocks for all I/Os */
>> + uint64 io_in_progress; /* sum of in-progress I/Os */
>
> IO stats in a struct named IO stats ;)
>

Old MacDonald had a farm, IO IO IO ...

>
>
>> @@ -4068,6 +4068,27 @@ show_scan_io_usage(ScanState *planstate, ExplainState *es)
>> /* Initialize counters with stats from the local process first */
>> switch (nodeTag(plan))
>> {
>> + case T_SeqScan:
>> + {
>> + SharedSeqScanInstrumentation *sinstrument
>> + = ((SeqScanState *) planstate)->sinstrument;
>> +
>> + /* collect prefetch statistics from the read stream */
>> + stats = planstate->ss_currentScanDesc->rs_instrument->io;
>> +
>> + /* get the sum of the counters set within each and every process */
>> + if (sinstrument)
>> + {
>> + for (int i = 0; i < sinstrument->num_workers; ++i)
>> + {
>> + SeqScanInstrumentation *winstrument = &sinstrument->sinstrument[i];
>> +
>> + AccumulateIOStats(&stats, &winstrument->stats.io);
>> + }
>> + }
>> +
>> + break;
>> + }
>> case T_BitmapHeapScan:
>> {
>> SharedBitmapHeapInstrumentation *sinstrument
>
> It's a bit sad how much repetition there is between the per-node support for
> each node type. Not sure there's a great solution though.
>
> Perhaps it'd be nicer if we instead sum up the stats in
> ExecSeqScanRetrieveInstrumentation() etc?
>

The duplication is not great, but wouldn't doing it in
ExecSeqScanRetrieveInstrumentation just move the duplicated code
elsewhere? I don't see how that'd be an improvement.

>> @@ -119,6 +119,13 @@ explain_filter
>> <Actual-Rows>N.N</Actual-Rows>
>> <Actual-Loops>N</Actual-Loops>
>> <Disabled>false</Disabled>
>> + <Average-Prefetch-Distance>N.N</Average-Prefetch-Distance>
>> + <Max-Prefetch-Distance>N</Max-Prefetch-Distance>
>> + <Prefetch-Capacity>N</Prefetch-Capacity>
>> + <I-O-Count>N</I-O-Count>
>> + <I-O-Waits>N</I-O-Waits>
>> + <Average-I-O-Size>N.N</Average-I-O-Size>
>> + <Average-I-Os-In-Progress>N.N</Average-I-Os-In-Progress>
>> <Shared-Hit-Blocks>N</Shared-Hit-Blocks>
>> <Shared-Read-Blocks>N</Shared-Read-Blocks>
>> <Shared-Dirtied-Blocks>N</Shared-Dirtied-Blocks>
>
> The I-O looks a bit ridiculous. But since people here seem to like I/O much
> better than IO (which I don't personally get), it's perhaps the right choice
> :(. I guess we also already have that precedent due to track_io_timing.
>

Uh, I didn't realize the XML transforms it like this. I'm not all that
attached to using "I/O" (even though it's the correct spelling!), but as
you say, track_io_timing already has the same issue.

regards
--
Tomas Vondra

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Andreas Karlsson 2026-03-31 18:57:48 Re: Improve pgindent's formatting named fields in struct literals and varidic functions
Previous Message Jacob Champion 2026-03-31 18:51:41 Re: Improve OAuth discovery logging