Quick Links

Re: Stack-based tracking of per-node WAL/buffer usage

From:	Andres Freund <andres(at)anarazel(dot)de>
To:	Lukas Fittl <lukas(at)fittl(dot)com>
Cc:	PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Zsolt Parragi <zsolt(dot)parragi(at)percona(dot)com>, Tomas Vondra <tomas(at)vondra(dot)me>, Peter Smith <smithpb2250(at)gmail(dot)com>, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
Subject:	Re: Stack-based tracking of per-node WAL/buffer usage
Date:	2026-04-04 19:39:16
Message-ID:	57biou6l65r7gr4nunoe6lignz2x6m3w45gihoypaez4pc46di@txj3bakhj66l
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

Hi,

On 2026-04-04 02:43:50 -0700, Lukas Fittl wrote:
> Attached v12, rebased, otherwise no changes.
>
> I realize time to freeze is getting close, and whilst I'd love to see
> this go in, I'm also realistic - so I'll just do my best to support
> review in the off chance we can make it for this release.

I unfortunately think there's enough nontrivial design decisions - that I
don't have a sufficiently confident assesment of - that I would be pretty
hesitant to commit this at this stage of the cycle without some architectural
review by senior folks. If Heikki did another round or two of review, it'd be a
different story.

I think the high level design is a huge improvement and goes in the right
direction, but some of the lower level stuff I'm far less confident about.

> On that note, I think 0001 and 0002 are independently useful
> refactorings to split the different kinds of instrumentation that
> should be ready to go, and I don't think should conflict much with
> other patches in this commitfest.

Yea, I'll see that those get committed.

I could also see 0004 as potentially worth getting committed separately,
although I'm a bit worried about test stability.

I recently looked at a coverage report, in the context of the index
prefetching patch, and was a it surprised that prominent parallel executor
nodes have no coverage for EXPLAIN ANALYZE.
parallel BHS is not covered:
https://coverage.postgresql.org/src/backend/executor/nodeBitmapHeapscan.c.gcov.html#L536
parallel IOS is not covered:
https://coverage.postgresql.org/src/backend/executor/nodeIndexonlyscan.c.gcov.html#L430

> From 90a7ed18f14c09c8a1299db3a015747fc6b6761c Mon Sep 17 00:00:00 2001
> From: Lukas Fittl <lukas(at)fittl(dot)com>
> Date: Tue, 9 Sep 2025 02:16:59 -0700
> Subject: [PATCH v12 5/9] Optimize measuring WAL/buffer usage through
> stack-based instrumentation
>
> Previously, in order to determine the buffer/WAL usage of a given code
> section, we utilized continuously incrementing global counters that get
> updated when the actual activity (e.g. shared block read) occurred, and
> then calculated a diff when the code section ended. This resulted in a
> bottleneck for executor node instrumentation specifically, with the
> function BufferUsageAccumDiff showing up in profiles and in some cases
> adding up to 10% overhead to an EXPLAIN (ANALYZE, BUFFERS) run.
>
> Instead, introduce a stack-based mechanism, where the actual activity
> writes into the current stack entry. In the case of executor nodes, this
> means that each node gets its own stack entry that is pushed at
> InstrStartNode, and popped at InstrEndNode. Stack entries are zero
> initialized (avoiding the diff mechanism) and get added to their parent
> entry when they are finalized, i.e. no more modifications can occur.
>
> To correctly handle abort situations, any use of instrumentation stacks
> must involve either a top-level QueryInstrumentation struct, and its
> associated InstrQueryStart/InstrQueryStop helpers (which use resource
> owners to handle aborts), or the Instrumentation struct itself with
> dedicated PG_TRY/PG_FINALLY calls that ensure the stack is in a
> consistent state after an abort.
>
> This also drops the global pgBufferUsage, any callers interested in
> measuring buffer activity should instead utilize InstrStart/InstrStop.
>
> The related global pgWalUsage is kept for now due to its use in pgstat
> to track aggregate WAL activity and heap_page_prune_and_freeze for
> measuring FPIs.

Probably worth stating what the performance overhead of WAL and BUFFERS is
after this patch?

> @@ -1015,19 +994,9 @@ pgss_ExecutorStart(QueryDesc *queryDesc, int eflags)
> */
> if (pgss_enabled(nesting_level) && queryDesc->plannedstmt->queryId != INT64CONST(0))
> {
> - /*
> - * Set up to track total elapsed time in ExecutorRun. Make sure the
> - * space is allocated in the per-query context so it will go away at
> - * ExecutorEnd.
> - */
> + /* Set up to track total elapsed time in ExecutorRun. */
> if (queryDesc->totaltime == NULL)
> - {
> - MemoryContext oldcxt;
> -
> - oldcxt = MemoryContextSwitchTo(queryDesc->estate->es_query_cxt);
> - queryDesc->totaltime = InstrAlloc(INSTRUMENT_ALL);
> - MemoryContextSwitchTo(oldcxt);
> - }
> + queryDesc->totaltime = InstrQueryAlloc(INSTRUMENT_ALL);
> }
> }

Not at all the fault of this patch, but it does seem somewhat odd to me that
we handle pgss/auto_explain wanting instrumentation by them updating the
QueryDesc->totaltime, rather than having extensions add an eflag to ask
standard_ExecutorStart to do so.

> @@ -2434,8 +2434,8 @@ _brin_begin_parallel(BrinBuildState *buildstate, Relation heap, Relation index,
> * and PARALLEL_KEY_BUFFER_USAGE.
> *
> * If there are no extensions loaded that care, we could skip this. We
> - * have no way of knowing whether anyone's looking at pgWalUsage or
> - * pgBufferUsage, so do it unconditionally.
> + * have no way of knowing whether anyone's looking at instrumentation, so
> + * do it unconditionally.
> */
> shm_toc_estimate_chunk(&pcxt->estimator,
> mul_size(sizeof(WalUsage), pcxt->nworkers));
> @@ -2887,6 +2887,7 @@ _brin_parallel_build_main(dsm_segment *seg, shm_toc *toc)
> Relation indexRel;
> LOCKMODE heapLockmode;
> LOCKMODE indexLockmode;
> + QueryInstrumentation *instr;
> WalUsage *walusage;
> BufferUsage *bufferusage;
> int sortmem;
> @@ -2936,7 +2937,7 @@ _brin_parallel_build_main(dsm_segment *seg, shm_toc *toc)
> tuplesort_attach_shared(sharedsort, seg);
>
> /* Prepare to track buffer usage during parallel execution */
> - InstrStartParallelQuery();
> + instr = InstrStartParallelQuery();
>
> /*
> * Might as well use reliable figure when doling out maintenance_work_mem
> @@ -2951,7 +2952,8 @@ _brin_parallel_build_main(dsm_segment *seg, shm_toc *toc)
> /* Report WAL/buffer usage during parallel execution */
> bufferusage = shm_toc_lookup(toc, PARALLEL_KEY_BUFFER_USAGE, false);
> walusage = shm_toc_lookup(toc, PARALLEL_KEY_WAL_USAGE, false);
> - InstrEndParallelQuery(&bufferusage[ParallelWorkerNumber],
> + InstrEndParallelQuery(instr,
> + &bufferusage[ParallelWorkerNumber],
> &walusage[ParallelWorkerNumber]);
>
> index_close(indexRel, indexLockmode);

Again not your fault, but it feels like the parallel index build
infrastructure is all wrong. Reimplementing this stuff for every index type
makes no sense.

> @@ -324,14 +324,16 @@ standard_ExplainOneQuery(Query *query, int cursorOptions,
> QueryEnvironment *queryEnv)
> {
> PlannedStmt *plan;
> - instr_time planstart,
> - planduration;
> - BufferUsage bufusage_start,
> - bufusage;
> + QueryInstrumentation *instr = NULL;
> MemoryContextCounters mem_counters;
> MemoryContext planner_ctx = NULL;
> MemoryContext saved_ctx = NULL;
>
> + if (es->buffers)
> + instr = InstrQueryAlloc(INSTRUMENT_TIMER | INSTRUMENT_BUFFERS);
> + else
> + instr = InstrQueryAlloc(INSTRUMENT_TIMER);

I was momentarily confused why this only checks es->buffers, not es->wal, but
I now see this is just for the planning, and we haven't displayed WAL there so
far, even if it's possible that we would emit some WAL. Probably would name
it plan_instr or such. And I'd probably go for one InstrQueryAlloc() with a
flags variable that's set depending on es->buffers, because it seems likely
we'll add more stuff to track over time...

> if (es->memory)
> {
> /*
> @@ -348,15 +350,12 @@ standard_ExplainOneQuery(Query *query, int cursorOptions,
> saved_ctx = MemoryContextSwitchTo(planner_ctx);
> }
>
> - if (es->buffers)
> - bufusage_start = pgBufferUsage;
> - INSTR_TIME_SET_CURRENT(planstart);
> + InstrQueryStart(instr);
>
> /* plan the query */
> plan = pg_plan_query(query, queryString, cursorOptions, params, es);
>
> - INSTR_TIME_SET_CURRENT(planduration);
> - INSTR_TIME_SUBTRACT(planduration, planstart);
> + InstrQueryStopFinalize(instr);
>
> if (es->memory)
> {
> @@ -364,16 +363,9 @@ standard_ExplainOneQuery(Query *query, int cursorOptions,
> MemoryContextMemConsumed(planner_ctx, &mem_counters);
> }
>
> - /* calc differences of buffer counters. */
> - if (es->buffers)
> - {
> - memset(&bufusage, 0, sizeof(BufferUsage));
> - BufferUsageAccumDiff(&bufusage, &pgBufferUsage, &bufusage_start);
> - }
> -
> /* run it (if needed) and produce output */
> ExplainOnePlan(plan, into, es, queryString, params, queryEnv,
> - &planduration, (es->buffers ? &bufusage : NULL),
> + &instr->instr.total, (es->buffers ? &instr->instr.bufusage : NULL),
> es->memory ? &mem_counters : NULL);
> }

Kinda wonder if some of this could be moved into a preliminary patch, without
depending on the stack infrastructure. Even with the current instrumentation
buffers and timing could be handled that way.

> @@ -590,7 +582,12 @@ ExplainOnePlan(PlannedStmt *plannedstmt, IntoClause *into, ExplainState *es,
>
> /* grab serialization metrics before we destroy the DestReceiver */
> if (es->serialize != EXPLAIN_SERIALIZE_NONE)
> - serializeMetrics = GetSerializationMetrics(dest);
> + {
> + SerializeMetrics *metrics = GetSerializationMetrics(dest);
> +
> + if (metrics)
> + memcpy(&serializeMetrics, metrics, sizeof(SerializeMetrics));
> + }

The current code just returning a made up zeroed metrics in that case is
somewhat weird...

> +++ b/src/backend/executor/README.instrument
> @@ -0,0 +1,227 @@
> +src/backend/executor/README.instrument
> +
> +Instrumentation
> +===============
> +
> +The instrumentation subsystem measures time, buffer usage and WAL activity
> +during query execution and other similar activities. It is used by
> +EXPLAIN ANALYZE, pg_stat_statements, and other consumers that need
> +activity and/or timing metrics over a section of code.
> +
> +The design has two central goals:
> +
> +* Make it cheap to measure activity in a section of code, even when
> + that section is called many times and the aggregate is what is used
> + (as is the case with per-node instrumentation in the executor)
> +
> +* Ensure nested instrumentation accurately measures activity/timing,
> + and counter updates from activity get written to the currently
> + active instrumentation and accumulated upward to parent nodes when
> + finalized, considering aborts due to errors.

The second paragraph assumes implementation choices that haven't been
introduced yet. Without knowing that it's something stack based it's very
unclear what "currently active" and "accumulate upward" mean.

I'd probably just say that nested instrumentation needs to work including in
the case of errors getting thrown.

> +Instrumentation Options
> +-----------------------
> +
> +Callers specify what to measure with a bitmask of InstrumentOption flags:
> +
> + INSTRUMENT_ROWS -- row counts only (used with NodeInstrumentation)
> + INSTRUMENT_TIMER -- wall-clock timing and row counts
> + INSTRUMENT_BUFFERS -- buffer hit/read/dirtied/written counts and I/O time
> + INSTRUMENT_WAL -- WAL records, FPI, bytes
> +
> +INSTRUMENT_BUFFERS and INSTRUMENT_WAL utilize the instrumentation stack
> +(described below) for efficient handling of counter values.

Not for now, but it seems like es->memory really should also be implemented
this way, rather than be implemented ad-hoc.

> +Struct Hierarchy
> +----------------
> +
> +There are four instrumentation structs, each specialized for a different
> +scope:

I'd just say "the following", otherwise the "four" will inevitably not get
updated when another type of instrumentation is introduced :)

> +Stack-based instrumentation
> +===========================
> +
> +For tracking WAL or buffer usage counters, the specialized stack-based
> +instrumentation is used.

I think I'd add a sentence explaining why a stack is used, i.e. that the
alternative approach of taking a snapshot of all stats at the start of the
instrumented section and diffing it at the end & accumulating the differences,
is quite expensive.

And then explain that the solution, i.e. to point to a struct that should
receive stats, has the problem of not supporting nesting, if done naively.

Without those it's a bit hard to understand the motivation for all this.

> +Any buffer or WAL counter update (via the INSTR_BUFUSAGE_* and
> +INSTR_WALUSAGE_* macros in the buffer manager, WAL insertion code, etc.)
> +writes directly into instr_stack.current. Each instrumentation node starts
> +zeroed, so the values it accumulates while on top of the stack represent
> +exactly the activity that occurred during that time.
> +
> +Every Instrumentation node has a target, or parent, it will be accumulated
> +into, which is typically the Instrumentation that was the current stack
> +entry when it was created.

*(other than instr_top)?

> +For example, when Seq Scan A gets finalized in regular execution via ExecutorFinish,
> +its instrumentation data gets added to the immediate parent in
> +the execution tree, the NestLoop, which will then get added to Query A's
> +QueryInstrumentation, which then accumulates to the parent.
> +
> +While we can typically think of this as a tree, the NodeInstrumentation
> +underneath a particular QueryInstrumentation could behave differently --
> +for example, it could propagate directly to the QueryInstrumentation, in
> +order to not show cumulative numbers in EXPLAIN ANALYZE.

Hm. This seems like a somewhat random example, why would one want this?

> +Note these relationships are partially implicit, especially when it comes
> +to NodeInstrumentation. Each QueryInstrumentation maintains a list of its
> +unfinalized child nodes. The parent of a QueryInstrumentation itself is
> +determined by the stack (see below): when a query is finalized or cleaned
> +up on abort, its counters are accumulated to whatever entry is then current
> +on the stack, which is typically instr_top.
> +
> +
> +Finalization and Abort Safety
> +=============================
> +
> +Finalization is the process of rolling up a node's buffer/WAL counters to
> +its parent. In normal execution, nodes are pushed onto the stack when they
> +start and popped when they stop; at finalization time their accumulated
> +counters are added to the parent.

> +Due to the use of longjmp for error handling, functions can exit abruptly
> +without executing their normal cleanup code. On abort, two things need
> +to happen:
> +
> +1. Reset the stack to the appropriate level. This ensures that we don't

s/Reset the stack to/The stack is reset to/

And maybe something like:
s/appropriate/to the level saved at the start of the aborting (sub-)transaction/

> + later try to update counters on a freed stack entry. We also need to
> + ensure that the stack entry that was current before a particular
> + Instrumentation started, is current again after it stops.
> +
> +2. Finalize all affected Instrumentation nodes, rolling up their counters
> + to the highest surviving Instrumentation, so that data is not lost.

I was about to say that to me it's the lowest surviving one, but I guess
that's just a question of whether you think of stacks as growing up or down...

Perhaps it should be something like "innermost surviving"?

> +For example, if Seq Scan B aborts while the stack is:
> +
> + instr_top (implicit bottom)
> + 0: Query A
> + 1: NestLoop
> + 2: Seq Scan B
> +
> +The abort handler for Query A accumulates all unfinalized children (Seq
> +Scan A, Seq Scan B, NestLoop) directly into Query A's counters, then
> +unwinds the stack and accumulates Query A's counters to instr_top.

s/stack/instrumentation stack/, otherwise it might be confused with the C
callstack.

> +Note that on abort the children do not accumulate through each other (Seq
> +Scan B -> NestLoop -> Query A); they all accumulate directly to their
> +parent QueryInstrumentation. This means the order in which children are
> +released does not matter -- important because ResourceOwner cleanup does
> +not guarantee a particular release order.

s/important/this is important/

> The per-node breakdown is lost, +but the query-level total is what survives
> the abort.

Is that actually the typical scenario? Most of the time - including afaict in
your example above - there won't be a query-level instrumentation that
survives, but the stats are accumulated into instr_top.

> +If multiple QueryInstrumentations are active on the stack (e.g. nested
> +portals), each one's abort handler uses InstrStopFinalize to unwind to
> +whichever entry is higher up, so they compose correctly regardless of
> +release order.

Maybe "the abort handler of each uses InstrStopFinalize() to accumulate the
statics to its parent entry"?

> +Resource Owner (QueryInstrumentation)
> +-------------------------------------
> +
> +QueryInstrumentation registers with the current ResourceOwner at start.
> +On transaction abort, the resource owner system calls the release callback,
> +which walks unfinalized child entries, accumulates their data, unwinds the
> +stack, and destroys the dedicated memory context (freeing the
> +QueryInstrumentation and all child allocations as a unit).
> +
> +This is the recommended approach when the instrumented code already has an
> +appropriate resource owner (e.g. it runs inside a portal). The query
> +executor uses this path.
> +
> +PG_FINALLY (base Instrumentation)
> +----------------------------------
> +
> +When no suitable resource owner exists, or when the caller wants to inspect
> +the instrumentation data even after an error, the base Instrumentation can
> +be used with a PG_TRY/PG_FINALLY block that calls InstrStopFinalize().
> +
> +Both mechanisms add overhead, so neither is suitable for high-frequency

I guess the "both" here refers to "Resource Owner" and "PG_FINALLY"? Given
this paragraph looks to be in the PG_FINALLY section, that's not quite
obvious.

Might just need another newline or such to make it clearer that this is not in
the PG_FINALLY section. Or maybe the two options should be in a bulleted list
or such.

> +instrumentation like per-node measurements in the executor. Instead,
> +plan node and trigger children rely on their parent QueryInstrumentation
> +for abort safety: they are allocated in the parent's memory context and
> +registered in its unfinalized-entries list, so the parent's abort handler
> +recovers their data automatically. In normal execution, children are
> +finalized explicitly by the caller.

> +Parallel Query
> +--------------
> +
> +Parallel workers get their own QueryInstrumentation so they can measure
> +buffer and WAL activity independently, then copy the totals into shared
> +memory at shutdown. The leader accumulates these into its own stack.

s/shared/dynamic shared/
Maybe s/shutdown/worker shutdown/?

> +When per-node instrumentation is active, parallel workers skip per-node
> +finalization at shutdown to avoid double-counting; the per-node data is
> +aggregated separately through InstrAggNode().

That's a bit gnarly, but I don't really see a better option.

> +Memory Handling
> +===============
> +
> +Instrumentation objects that use the stack must survive until finalization
> +runs, including the abort case. To ensure this, QueryInstrumentation
> +creates a dedicated "Instrumentation" MemoryContext (instr_cxt) as a child
> +of TopMemoryContext. All child instrumentation (nodes, triggers) should be
> +allocated in this context.

> +On successful completion, instr_cxt is reparented to CurrentMemoryContext
> +so its lifetime is tied to the caller's context. On abort, the
> +ResourceOwner cleanup frees it after accumulating the instrumentation data
> +to the current stack entry after resetting the stack.

Makes sense.

I mildly wonder if we should create one minimally sized "Instrumentations"
node under TopMemoryContext, below which the "Instrumentation" contexts are
created, instead of doing so directly under TopMemoryContext. But that's
something that can easily be evolved later.

> @@ -247,9 +248,19 @@ standard_ExecutorStart(QueryDesc *queryDesc, int eflags)
> estate->es_snapshot = RegisterSnapshot(queryDesc->snapshot);
> estate->es_crosscheck_snapshot = RegisterSnapshot(queryDesc->crosscheck_snapshot);
> estate->es_top_eflags = eflags;
> - estate->es_instrument = queryDesc->instrument_options;
> estate->es_jit_flags = queryDesc->plannedstmt->jitFlags;
>
> + /*
> + * Set up query-level instrumentation if needed. We do this before
> + * InitPlan so that node and trigger instrumentation can be allocated
> + * within the query's dedicated instrumentation memory context.
> + */
> + if (!queryDesc->totaltime && queryDesc->instrument_options)
> + {
> + queryDesc->totaltime = InstrQueryAlloc(queryDesc->instrument_options);
> + estate->es_instrument = queryDesc->totaltime;
> + }
> +
> /*
> * Set up an AFTER-trigger statement context, unless told not to, or
> * unless it's EXPLAIN-only mode (when ExecutorFinish won't be called).

It seems pretty weird to still have queryDesc->totaltime *sometimes* created
by pgss etc, but also create it in standard_ExecutorStart if not already
created. What if the explain options aren't compatible? Sure
pgss/auto_explain use ALL, but that's not a given.

> + /* Start up instrumentation for this execution run */
> if (queryDesc->totaltime)
> - InstrStart(queryDesc->totaltime);
> + {
> + InstrQueryStart(queryDesc->totaltime);
> +
> + /*
> + * Remember all node entries for abort recovery. We do this once here
> + * after InstrQueryStart has pushed the parent stack entry.
> + */
> + if (estate->es_instrument &&
> + estate->es_instrument->instr.need_stack &&
> + !queryDesc->already_executed)
> + ExecRememberNodeInstrumentation(queryDesc->planstate,
> + queryDesc->totaltime);
> + }

Hm. Was briefly worried about the overhead of
ExecRememberNodeInstrumentation() in the context of cursors. But I see it's
only done once.

But why do we not just associate the NodeInstrumentation's with the
QueryInstrumentation during the creation of the NodeInstrumentation?

> + /*
> + * Accumulate per-node and trigger statistics to their respective parent
> + * instrumentation stacks.
>
> + * We skip this in parallel workers because their per-node stats are
> + * reported individually via ExecParallelReportInstrumentation, and the
> + * leader's own ExecFinalizeNodeInstrumentation handles propagation. If
> + * we accumulated here, the leader would double-count: worker parent nodes
> + * would already include their children's stats, and then the leader's
> + * accumulation would add the children again.
> + */

Haven't looked into how this all works in sufficient detail, so I'm just
asking you: This works correctly even when using EXPLAIN (ANALYZE, VERBOSE)
showing per-worker "subtrees"?

> + if (queryDesc->totaltime && estate->es_instrument && !IsParallelWorker())
> + {
> + ExecFinalizeNodeInstrumentation(queryDesc->planstate);
> +
> + ExecFinalizeTriggerInstrumentation(estate);
> + }
> +
> if (queryDesc->totaltime)
> - InstrStop(queryDesc->totaltime);
> + InstrQueryStopFinalize(queryDesc->totaltime);

I'd probably move the estate->es_instrument && !IsParallelWorker() check into
the if (queryDesc->totaltime).

> @@ -1284,8 +1325,8 @@ InitResultRelInfo(ResultRelInfo *resultRelInfo,
> palloc0_array(FmgrInfo, n);
> resultRelInfo->ri_TrigWhenExprs = (ExprState **)
> palloc0_array(ExprState *, n);
> - if (instrument_options)
> - resultRelInfo->ri_TrigInstrument = InstrAllocTrigger(n, instrument_options);
> + if (qinstr)
> + resultRelInfo->ri_TrigInstrument = InstrAllocTrigger(qinstr, n);

Hm. Why do we not need to pass down the instrument_options anymore? I guess
the assumption is that we always are going to use the flags from qinstr?

Is that right? Because right now pgss/auto_explain use _ALL, even when an
EXPLAIN ANALYZE doesn't.

> +static void
> +ExecFinalizeTriggerInstrumentation(EState *estate)
> +{
> + List *rels = NIL;
> +
> + rels = list_concat(rels, estate->es_tuple_routing_result_relations);
> + rels = list_concat(rels, estate->es_opened_result_relations);
> + rels = list_concat(rels, estate->es_trig_target_relations);

Maybe ExecGetTriggerResultRel() needs a comment about needing to update
ExecFinalizeTriggerInstrumentation() if trigger stuff were to be stored in yet
another place?

> @@ -1081,14 +1081,28 @@ ExecParallelRetrieveInstrumentation(PlanState *planstate,
> instrument = GetInstrumentationArray(instrumentation);
> instrument += i * instrumentation->num_workers;
> for (n = 0; n < instrumentation->num_workers; ++n)
> + {
> InstrAggNode(planstate->instrument, &instrument[n]);
>
> + /*
> + * Also add worker WAL usage to the global pgWalUsage counter.
> + *
> + * When per-node instrumentation is active, parallel workers skip
> + * ExecFinalizeNodeInstrumentation (to avoid double-counting in
> + * EXPLAIN), so per-node WAL activity is not rolled up into the
> + * query-level stats that InstrAccumParallelQuery receives. Without
> + * this, pgWalUsage would under-report WAL generated by parallel
> + * workers when instrumentation is active.
> + */
> + WalUsageAdd(&pgWalUsage, &instrument[n].instr.walusage);
> + }

I'm not sure I understand why this doesn't also lead to double counting, given
that InstrAccumParallelQuery() does also add the worker's usage to pgWalUsage?

> +static bool
> +ExecFinalizeNodeInstrumentation_walker(PlanState *node, void *context)
> +{
> + Instrumentation *parent = (Instrumentation *) context;
> +
> + Assert(parent != NULL);
> +
> + if (node == NULL)
> + return false;
> +
> + /*
> + * Recurse into children first (bottom-up accumulation), passing our
> + * instrumentation as the parent context. This ensures children can
> + * accumulate to us even if they were never executed by the leader (e.g.
> + * nodes beneath Gather that only workers ran).
> + */
> + planstate_tree_walker(node, ExecFinalizeNodeInstrumentation_walker,
> + node->instrument ? &node->instrument->instr : parent);

I don't think I understand that comment. What changes if the leader's node
was never executed?

> @@ -227,6 +227,15 @@ FreeExecutorState(EState *estate)
> estate->es_partition_directory = NULL;
> }
>
> + /*
> + * Make sure the instrumentation context gets freed. This usually gets
> + * re-parented under the per-query context in InstrQueryStopFinalize, but
> + * that won't happen during EXPLAIN (BUFFERS) since ExecutorFinish never
> + * gets called, so we would otherwise leak it in TopMemoryContext.
> + */
> + if (estate->es_instrument && estate->es_instrument->instr.need_stack)
> + MemoryContextDelete(estate->es_instrument->instr_cxt);
> +

Ugh.

> diff --git a/src/backend/executor/instrument.c b/src/backend/executor/instrument.c
> index bc551f95a08..6892706a83a 100644
> --- a/src/backend/executor/instrument.c
> +++ b/src/backend/executor/instrument.c
> @@ -16,30 +16,46 @@
> #include <unistd.h>
>
> #include "executor/instrument.h"
> +#include "utils/memutils.h"
> +#include "utils/resowner.h"
>
> -BufferUsage pgBufferUsage;
> -static BufferUsage save_pgBufferUsage;
> WalUsage pgWalUsage;

Why do we still need pgWalUsage if we have the same data in instr_stack.

> -static WalUsage save_pgWalUsage;
> +Instrumentation instr_top;
> +InstrStackState instr_stack = {0, 0, NULL, &instr_top};

I'd use designated initializers to make this easier to read.

> -static void BufferUsageAdd(BufferUsage *dst, const BufferUsage *add);
> -static void WalUsageAdd(WalUsage *dst, WalUsage *add);
> +void
> +InstrStackGrow(void)
> +{
> + int space = instr_stack.stack_space;
> +
> + if (instr_stack.entries == NULL)
> + {
> + space = 10; /* Allocate sufficient initial space for
> + * typical activity */
> + instr_stack.entries = MemoryContextAlloc(TopMemoryContext,
> + sizeof(Instrumentation *) * space);
> + }
> + else
> + {
> + space *= 2;
> + instr_stack.entries = repalloc_array(instr_stack.entries, Instrumentation *, space);
> + }
>
> + /* Update stack space after allocation succeeded to protect against OOMs */
> + instr_stack.stack_space = space;
> +}

Perhaps worth adding an assert to check that we actually needed space?

> +/*
> + * Stops instrumentation, finalizes the stack entry and accumulates to its parent.
> + *
> + * Note that this intentionally allows passing a stack that is not the current
> + * top, as can happen with PG_FINALLY, or resource owners, which don't have a
> + * guaranteed cleanup order.
> + *
> + * We are careful here to achieve two goals:
> + *
> + * 1) Reset the stack to the parent of whichever of the released stack entries
> + * has the lowest index
> + * 2) Accumulate all instrumentation to the currently active instrumentation,
> + * so that callers get a complete picture of activity, even after an abort
> + */
> +void
> +InstrStopFinalize(Instrumentation *instr)
> +{
> + int idx = -1;
> +
> + for (int i = instr_stack.stack_size - 1; i >= 0; i--)
> + {
> + if (instr_stack.entries[i] == instr)
> + {
> + idx = i;
> + break;
> + }
> + }

So this may not find a stack entry, because a prior call to
InstrStopFinalize() already removed it from the stack, right?

Makes it a bit more error prone. Maybe we should store whether the element is
still on the stack in the Instrumentation, that way we a) can error out if we
don't find it on the stack b) avoid searching the stack if already removed.

> if (instr->need_timer)
> + InstrStopTimer(instr);
> +
> + InstrAccumStack(instr_stack.current, instr);
> +}

Not that it's a huge issue, but seems like it'd be neater if the need_timer
thing weren't duplicated, but implemented by calling InstrStop()?

> +void
> +InstrQueryStart(QueryInstrumentation *qinstr)
> +{
> + InstrStart(&qinstr->instr);
> +
> + if (qinstr->instr.need_stack)
> + {
> + Assert(CurrentResourceOwner != NULL);
> + qinstr->owner = CurrentResourceOwner;
> +
> + ResourceOwnerEnlarge(qinstr->owner);
> + ResourceOwnerRememberInstrumentation(qinstr->owner, qinstr);
> + }
> +}
> +
> +void
> +InstrQueryStop(QueryInstrumentation *qinstr)
> +{
> + InstrStop(&qinstr->instr);
> +
> + if (qinstr->instr.need_stack)
> + {
> + Assert(qinstr->owner != NULL);
> + ResourceOwnerForgetInstrumentation(qinstr->owner, qinstr);
> + qinstr->owner = NULL;
> + }
> +}
> +
> +void
> +InstrQueryStopFinalize(QueryInstrumentation *qinstr)
> +{
> + InstrStopFinalize(&qinstr->instr);

Why are these Instr[Query]StopFinalize() rather than just
Instr[Query]Finalize()?

> + if (!qinstr->instr.need_stack)
> + return;

Perhaps worth asserting that qinstr->{instr_cxt,owner} are NULL in this case?

> +/* start instrumentation during parallel executor startup */
> +QueryInstrumentation *
> +InstrStartParallelQuery(void)
> +{
> + QueryInstrumentation *qinstr = InstrQueryAlloc(INSTRUMENT_BUFFERS | INSTRUMENT_WAL);
> +
> + InstrQueryStart(qinstr);
> + return qinstr;
> +}

Why do we hardcode INSTRUMENT_BUFFERS | INSTRUMENT_WAL?

> From 80dbf65f79deca08f5e10872cac226d0d8edca0e Mon Sep 17 00:00:00 2001
> From: Lukas Fittl <lukas(at)fittl(dot)com>
> Date: Sun, 15 Mar 2026 21:44:58 -0700
> Subject: [PATCH v12 6/9] instrumentation: Use Instrumentation struct for
> parallel workers
>
> This simplifies the DSM allocations a bit since we don't need to
> separately allocate WAL and buffer usage, and allows the easier future
> addition of a third stack-based struct being discussed.

Does look a bit nicer.

> From 16e44d5508f91dd23da780901f3ec0126965628d Mon Sep 17 00:00:00 2001
> From: Lukas Fittl <lukas(at)fittl(dot)com>
> Date: Sat, 7 Mar 2026 17:52:24 -0800
> Subject: [PATCH v12 7/9] instrumentation: Optimize ExecProcNodeInstr
> instructions by inlining
>
> For most queries, the bulk of the overhead of EXPLAIN ANALYZE happens in
> ExecProcNodeInstr when starting/stopping instrumentation for that node.
>
> Previously each ExecProcNodeInstr would check which instrumentation
> options are active in the InstrStartNode/InstrStopNode calls, and do the
> corresponding work (timers, instrumentation stack, etc.). These
> conditionals being checked for each tuple being emitted add up, and cause
> non-optimal set of instructions to be generated by the compiler.
>
> Because we already have an existing mechanism to specify a function
> pointer when instrumentation is enabled, we can instead create specialized
> functions that are tailored to the instrumentation options enabled, and
> avoid conditionals on subsequent ExecProcNodeInstr calls. This results in
> the overhead for EXPLAIN (ANALYZE, TIMING OFF, BUFFERS OFF) for a stress
> test with a large COUNT(*) that does many ExecProcNode calls from ~ 20% on
> top of actual runtime to ~ 3%. When using BUFFERS ON the same query goes
> from ~ 20% to ~ 10% on top of actual runtime.

I assume this is to a significant degree due to to allowing for inlining. Have
you checked how much of the effort you get by just putting ExecProcNodeInstr()
into instrument.c?

> +/*
> + * Specialized handling of instrumented ExecProcNode
> + *
> + * These functions are equivalent to running ExecProcNodeReal wrapped in
> + * InstrStartNode and InstrStopNode, but avoid the conditionals in the hot path
> + * by checking the instrumentation options when the ExecProcNode pointer gets
> + * first set, and then using a special-purpose function for each. This results
> + * in a more optimized set of compiled instructions.
> + */
> +
> +#include "executor/tuptable.h"
> +#include "nodes/execnodes.h"
> +
> +/* Simplified pop: restore saved state instead of re-deriving from array */
> +static inline void
> +InstrPopStackTo(Instrumentation *prev)
> +{
> + Assert(instr_stack.stack_size > 0);
> + Assert(instr_stack.stack_size > 1 ? instr_stack.entries[instr_stack.stack_size - 2] == prev : &instr_top == prev);
> + instr_stack.stack_size--;
> + instr_stack.current = prev;
> +}
> +
> +static inline TupleTableSlot *
> +ExecProcNodeInstr(PlanState *node, bool need_timer, bool need_stack)

This might need pg_attribute_always_inline to be reliable, the compiler
otherwise might decide that it should not actually inline the function...

> @@ -1014,7 +1016,9 @@ EXPLAIN ANALYZE SELECT * FROM polygon_tbl WHERE f1 @> polygon '(0.5,2.0)';
> then rejected by a recheck of the index condition. This happens because a
> GiST index is <quote>lossy</quote> for polygon containment tests: it actually
> returns the rows with polygons that overlap the target, and then we have
> - to do the exact containment test on those rows.
> + to do the exact containment test on those rows. The <literal>Table Buffers</literal>
> + counts indicate how many operations were performed on the table instead of
> + the index. This number is included in the <literal>Buffers</literal> counts.
> </para>
>
> <para>

I wonder if listing "Index Buffers" separately, instead of "Table Buffers"
would make more sense, because normally the number of index accesses is much
smaller and therefore a bit easier to put into relation to "Buffers".

> @@ -418,6 +418,29 @@ ExecInitNode(Plan *node, EState *estate, int eflags)
> result->instrument = InstrAllocNode(estate->es_instrument,
> result->async_capable);
>
> + /*
> + * IndexScan / IndexOnlyScan track table and index access separately.
> + *
> + * We intentionally don't collect timing for them (even if enabled), since
> + * we don't need it, and executor nodes call InstrPushStack /
> + * InstrPopStack (instead of the full InstrNode*) to reduce overhead.
> + */
> + if (estate->es_instrument && (estate->es_instrument->instrument_options & INSTRUMENT_BUFFERS) != 0)
> + {
> + if (IsA(result, IndexScanState))
> + {
> + IndexScanState *iss = castNode(IndexScanState, result);
> +
> + InstrInitOptions(&iss->iss_Instrument->table_instr, INSTRUMENT_BUFFERS);
> + }
> + else if (IsA(result, IndexOnlyScanState))
> + {
> + IndexOnlyScanState *ioss = castNode(IndexOnlyScanState, result);
> +
> + InstrInitOptions(&ioss->ioss_Instrument->table_instr, INSTRUMENT_BUFFERS);
> + }
> + }
> +
> return result;
> }

Why do this is in ExecInitNode(), rather than ExecInitIndexScan(),
ExecInitIndexOnlyScan()?

> @@ -165,11 +169,22 @@ IndexOnlyNext(IndexOnlyScanState *node)
> ItemPointerGetBlockNumber(tid),
> &node->ioss_VMBuffer))
> {
> + bool found;
> +
> /*
> * Rats, we have to visit the heap to check visibility.
> */
> InstrCountTuples2(node, 1);
> - if (!index_fetch_heap(scandesc, node->ioss_TableSlot))
> +
> + if (table_instr)
> + InstrPushStack(table_instr);
> +
> + found = index_fetch_heap(scandesc, node->ioss_TableSlot);
> +
> + if (table_instr)
> + InstrPopStack(table_instr);
> +
> + if (!found)
> continue; /* no visible tuple, try next index entry */
>
> ExecClearTuple(node->ioss_TableSlot);

As-is this will unfortunately rather terribly conflict with the way the index
prefetching patch is restructuring things, as after it neither index nor
indexonly scan does the equivalent of index_fetch_heap() anymore. This all
goes through a tableam interface, which in turn will call to the index to get
the tids (to allow for tableam specific prefetching logic, obviously).

I think this would require putting this into the IndexScanDesc via the
IndexScanInstrumentation etc.

Might be good for you to look at how that stuff works after the index
prefetching patch and comment if you see a problem.

Greetings,

Andres Freund

In response to

Re: Stack-based tracking of per-node WAL/buffer usage at 2026-04-04 09:43:50 from Lukas Fittl

Responses

Re: Stack-based tracking of per-node WAL/buffer usage at 2026-04-05 12:31:53 from Lukas Fittl

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	SATYANARAYANA NARLAPURAM	2026-04-04 19:56:00	Re: PG 19 release notes and authors
Previous Message	Nathan Bossart	2026-04-04 19:10:02	Re: Add pg_stat_autovacuum_priority