Quick Links

Re: Stack-based tracking of per-node WAL/buffer usage

From:	Andres Freund <andres(at)anarazel(dot)de>
To:	Lukas Fittl <lukas(at)fittl(dot)com>
Cc:	PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject:	Re: Stack-based tracking of per-node WAL/buffer usage
Date:	2025-10-22 12:59:02
Message-ID:	mflexfbmbkmql5snr322ldu72dfnjr2dntxx3ippaz4tcso7vl@zmxxgiws3qux
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

On 2025-10-22 14:28:24 +0300, Lukas Fittl wrote:
> On Tue, Sep 9, 2025 at 10:35 PM Lukas Fittl <lukas(at)fittl(dot)com> wrote:
>
> > Attached an updated patch set that addresses the feedback, and also adds
> > the complete removal of the global pgBufferUsage variable in later patches
> > (0005-0007), to avoid counting both the stack and the variable.
> >
>
> See attached the same patch set rebased on latest master.

> From d40f69cce15dfa10479c8be31917b33a49d01477 Mon Sep 17 00:00:00 2001
> From: Lukas Fittl <lukas(at)fittl(dot)com>
> Date: Sun, 31 Aug 2025 16:37:05 -0700
> Subject: [PATCH v3 1/7] Instrumentation: Keep time fields as instrtime,
> require caller to convert
>
> Previously the Instrumentation logic always converted to seconds, only for many
> of the callers to do unnecessary division to get to milliseconds. Since an upcoming
> refactoring will split the Instrumentation struct, utilize instrtime always to
> keep things simpler.

LGTM, think we should apply this regardless of the rest of the patches.

> @@ -173,14 +169,14 @@ InstrAggNode(Instrumentation *dst, Instrumentation *add)
> dst->running = true;
> dst->firsttuple = add->firsttuple;
> }
> - else if (dst->running && add->running && dst->firsttuple > add->firsttuple)
> + else if (dst->running && add->running && INSTR_TIME_CMP_LT(dst->firsttuple, add->firsttuple))
> dst->firsttuple = add->firsttuple;

This isn't due to this patch, but it seems a bit odd that we use the minimum
time for the first tuple, but the average time for the node's completion...

> diff --git a/src/include/portability/instr_time.h b/src/include/portability/instr_time.h
> index f71a851b18d..646934020d1 100644
> --- a/src/include/portability/instr_time.h
> +++ b/src/include/portability/instr_time.h
> @@ -184,6 +184,8 @@ GetTimerFrequency(void)
> #define INSTR_TIME_ACCUM_DIFF(x,y,z) \
> ((x).ticks += (y).ticks - (z).ticks)
>
> +#define INSTR_TIME_CMP_LT(x,y) \
> + ((x).ticks > (y).ticks)
>
> #define INSTR_TIME_GET_DOUBLE(t) \
> ((double) INSTR_TIME_GET_NANOSEC(t) / NS_PER_S)
> --
> 2.47.1

Any reason to actually have _CMP_ in the name? Other operations like _ADD
don't have such an additional verb in the name.

> From 7546f855d138d0dac0d8c22ea5915314810f13e5 Mon Sep 17 00:00:00 2001
> From: Lukas Fittl <lukas(at)fittl(dot)com>
> Date: Sat, 1 Mar 2025 19:31:30 -0800
> Subject: [PATCH v3 2/7] Separate node instrumentation from other use of
> Instrumentation struct
>
> Previously different places (e.g. query "total time") were repurposing
> the Instrumentation struct initially introduced for capturing per-node
> statistics during execution. This dual use of the struct is confusing,
> e.g. by cluttering calls of InstrStartNode/InstrStopNode in unrelated
> code paths, and prevents future refactorings.
>
> Instead, simplify the Instrumentation struct to only track time,
> WAL/buffer usage, and tuple counts. Similarly, drop the use of InstrEndLoop
> outside of per-node instrumentation. Introduce the NodeInstrumentation
> struct to carry forward the per-node instrumentation information.

> @@ -381,12 +381,6 @@ explain_ExecutorEnd(QueryDesc *queryDesc)
> */
> oldcxt = MemoryContextSwitchTo(queryDesc->estate->es_query_cxt);
>
> - /*
> - * Make sure stats accumulation is done. (Note: it's okay if several
> - * levels of hook all do this.)
> - */
> - InstrEndLoop(queryDesc->totaltime);
> -
> /* Log plan if duration is exceeded. */
> msec = INSTR_TIME_GET_MILLISEC(queryDesc->totaltime->total);
> if (msec >= auto_explain_log_min_duration)

Maybe add a comment about the removal of these InstrEndLoop() calls to the
commit message? If I understand correctly they were superfluous before, but
that's not entirely obvious when just looking at the patch.

> +/*
> + * General purpose instrumentation that can capture time, WAL/buffer usage and tuples
> + *
> + * Initialized through InstrAlloc, followed by one or more calls to a pair of
> + * InstrStart/InstrStop (activity is measured inbetween).
> + */
> typedef struct Instrumentation
> +{
> + /* Parameters set at creation: */
> + bool need_timer; /* true if we need timer data */
> + bool need_bufusage; /* true if we need buffer usage data */
> + bool need_walusage; /* true if we need WAL usage data */
> + /* Internal state keeping: */
> + instr_time starttime; /* start time of last InstrStart */
> + BufferUsage bufusage_start; /* buffer usage at start */
> + WalUsage walusage_start; /* WAL usage at start */
> + /* Accumulated statistics: */
> + instr_time total; /* total runtime */
> + double ntuples; /* total tuples counted in InstrStop */
> + BufferUsage bufusage; /* total buffer usage */
> + WalUsage walusage; /* total WAL usage */
> +} Instrumentation;

Maybe add a comment explaining why ntuples is in here?

> From aa1acccb3dfa6a5d81a9a049d8cb63762a3d7cf7 Mon Sep 17 00:00:00 2001
> From: Lukas Fittl <lukas(at)fittl(dot)com>
> Date: Tue, 9 Sep 2025 02:16:59 -0700
> Subject: [PATCH v3 4/7] Introduce stack for tracking per-node WAL/buffer usage

Could use a commit message :)

> ---
> .../pg_stat_statements/pg_stat_statements.c | 4 +-
> src/backend/commands/explain.c | 8 +-
> src/backend/commands/trigger.c | 4 +-
> src/backend/executor/execMain.c | 25 ++-
> src/backend/executor/execProcnode.c | 29 +++
> src/backend/executor/instrument.c | 199 ++++++++++++++----
> src/include/executor/executor.h | 1 +
> src/include/executor/instrument.h | 53 ++++-
> 8 files changed, 260 insertions(+), 63 deletions(-)
>
> diff --git a/contrib/pg_stat_statements/pg_stat_statements.c b/contrib/pg_stat_statements/pg_stat_statements.c
> index f43a33b3787..eeabd820d8e 100644
> --- a/contrib/pg_stat_statements/pg_stat_statements.c
> +++ b/contrib/pg_stat_statements/pg_stat_statements.c
> @@ -1089,8 +1089,8 @@ pgss_ExecutorEnd(QueryDesc *queryDesc)
> PGSS_EXEC,
> INSTR_TIME_GET_MILLISEC(queryDesc->totaltime->total),
> queryDesc->estate->es_total_processed,
> - &queryDesc->totaltime->bufusage,
> - &queryDesc->totaltime->walusage,
> + &INSTR_GET_BUFUSAGE(queryDesc->totaltime),
> + &INSTR_GET_WALUSAGE(queryDesc->totaltime),

Getting a pointer to something returned by a macro is a bit ugly... Perhaps
it'd be better to just pass the &queryDesc->totaltime? But ugh, that's not
easily possible given how pgss_planner() currently tracks things :(

Maybe it's worth refactoring this a bit in a precursor patch?

> @@ -1266,7 +1280,12 @@ InitResultRelInfo(ResultRelInfo *resultRelInfo,
> resultRelInfo->ri_TrigWhenExprs = (ExprState **)
> palloc0(n * sizeof(ExprState *));
> if (instrument_options)
> - resultRelInfo->ri_TrigInstrument = InstrAlloc(n, instrument_options);
> + {
> + if ((instrument_options & INSTRUMENT_TIMER) != 0)
> + resultRelInfo->ri_TrigInstrument = InstrAlloc(n, INSTRUMENT_TIMER);
> + else
> + resultRelInfo->ri_TrigInstrument = InstrAlloc(n, INSTRUMENT_ROWS);
> + }
> }
> else
> {

I'd not duplicate the InstrAlloc(), but compute the flags separately.

> /* ------------------------------------------------------------------------
> @@ -828,6 +829,34 @@ ExecShutdownNode_walker(PlanState *node, void *context)
> return false;
> }
>
> +/*
> + * ExecAccumNodeInstrumentation
> + *
> + * Accumulate instrumentation stats from all execution nodes to their respective
> + * parents (or the original parent instrumentation stack).
> + */
> +void
> +ExecAccumNodeInstrumentation(PlanState *node)
> +{
> + (void) ExecAccumNodeInstrumentation_walker(node, NULL);
> +}

I wonder if this is too narrow a name. There might be other uses of a pass
across the node tree at that point. OTOH, it's probably better to just rename
it at that later point.

> +static bool
> +ExecAccumNodeInstrumentation_walker(PlanState *node, void *context)
> +{
> + if (node == NULL)
> + return false;
> +
> + check_stack_depth();
> +
> + planstate_tree_walker(node, ExecAccumNodeInstrumentation_walker, context);

There already is a check_stack_depth() in planstate_tree_walker().

> + if (node->instrument && node->instrument->stack.previous)
> + InstrStackAdd(node->instrument->stack.previous, &node->instrument->stack);
> +
> + return false;
> +}

E.g. in ExecShutdownNode_walker we use planstate_tree_walker(), but then also
have special handling for a few node types. Do we need something like that
here too? It probably is ok, but it's worth explicitly checking and adding a
comment.

> +/*
> + * Use ResourceOwner mechanism to correctly reset pgInstrStack on abort.
> + */
> +static void ResOwnerReleaseInstrumentation(Datum res);
> +static const ResourceOwnerDesc instrumentation_resowner_desc =
> +{
> + .name = "instrumentation",
> + .release_phase = RESOURCE_RELEASE_BEFORE_LOCKS,
> + .release_priority = RELEASE_PRIO_FIRST,
> + .ReleaseResource = ResOwnerReleaseInstrumentation,
> + .DebugPrint = NULL, /* default message is fine */
> +};

Is there a reason to do the release here before the lock release? And why
_FIRST?

> +static void
> +ResOwnerReleaseInstrumentation(Datum res)
> +{
> + Instrumentation *instr = (Instrumentation *) DatumGetPointer(res);
> +
> + /*
> + * Because registered resources are *not* called in reverse order, we'll
> + * get what was first registered first at shutdown. Thus, on any later
> + * resources we need to not change the stack, which was already set to the
> + * correct previous entry.
> + */

FWIW, the release order is not guaranteed to be in that order either,
e.g. once resowner switches to hashing, it'll essentially be random.

> + if (pgInstrStack && !StackIsParent(pgInstrStack, &instr->stack))
> + pgInstrStack = instr->stack.previous;

Hm - this is effectively O(stack-depth^2), right? It's probably fine, given
that we have fairly limited nesting (explain + pg_stat_statements +
auto_explain is probably the current max), but seems worth noting in a
comment?

> + /*
> + * Always accumulate all collected stats before the abort, even if we
> + * already walked up the stack with an earlier resource.
> + */
> + if (pgInstrStack)
> + InstrStackAdd(pgInstrStack, &instr->stack);

Why are we accumulating stats in case of errors? It's probably fine, but doing
less as part of cleanup is pre ferrable...

> /* General purpose instrumentation handling */
> Instrumentation *
> InstrAlloc(int n, int instrument_options)
> {
> - Instrumentation *instr;
> + Instrumentation *instr = NULL;
> + bool need_buffers = (instrument_options & INSTRUMENT_BUFFERS) != 0;
> + bool need_wal = (instrument_options & INSTRUMENT_WAL) != 0;
> + bool need_timer = (instrument_options & INSTRUMENT_TIMER) != 0;
> + int i;
> +
> + /*
> + * If resource owner will be used, we must allocate in the transaction
> + * context (not the calling context, usually a lower context), because the
> + * memory might otherwise be freed too early in an abort situation.
> + */
> + if (need_buffers || need_wal)
> + instr = MemoryContextAllocZero(CurTransactionContext, n * sizeof(Instrumentation));
> + else
> + instr = palloc0(n * sizeof(Instrumentation));

Is this long-lived enough? I'm e.g. wondering about utility statements that
internally starting transactions, wouldn't that cause problems with a user
like pgss tracking something like CIC?

> - /* initialize all fields to zeroes, then modify as needed */
> - instr = palloc0(n * sizeof(Instrumentation));
> - if (instrument_options & (INSTRUMENT_BUFFERS | INSTRUMENT_TIMER | INSTRUMENT_WAL))
> + for (i = 0; i < n; i++)
> {
> - bool need_buffers = (instrument_options & INSTRUMENT_BUFFERS) != 0;
> - bool need_wal = (instrument_options & INSTRUMENT_WAL) != 0;
> - bool need_timer = (instrument_options & INSTRUMENT_TIMER) != 0;
> - int i;
> -
> - for (i = 0; i < n; i++)
> - {
> - instr[i].need_bufusage = need_buffers;
> - instr[i].need_walusage = need_wal;
> - instr[i].need_timer = need_timer;
> - }
> + instr[i].need_bufusage = need_buffers;
> + instr[i].need_walusage = need_wal;
> + instr[i].need_timer = need_timer;
> }
>
> return instr;
> }
> +
> void
> InstrStart(Instrumentation *instr)
> {
> + Assert(!instr->finalized);
> +
> if (instr->need_timer &&
> !INSTR_TIME_SET_CURRENT_LAZY(instr->starttime))
> elog(ERROR, "InstrStart called twice in a row");
>
> - if (instr->need_bufusage)
> - instr->bufusage_start = pgBufferUsage;
> -
> - if (instr->need_walusage)
> - instr->walusage_start = pgWalUsage;
> + if (instr->need_bufusage || instr->need_walusage)
> + InstrPushStackResource(instr);
> }
> +
> void
> -InstrStop(Instrumentation *instr, double nTuples)
> +InstrStop(Instrumentation *instr, double nTuples, bool finalize)
> {
> instr_time endtime;
>
> @@ -84,14 +178,15 @@ InstrStop(Instrumentation *instr, double nTuples)
> INSTR_TIME_SET_ZERO(instr->starttime);
> }
>
> - /* Add delta of buffer usage since entry to node's totals */
> - if (instr->need_bufusage)
> - BufferUsageAccumDiff(&instr->bufusage,
> - &pgBufferUsage, &instr->bufusage_start);
> + if (instr->need_bufusage || instr->need_walusage)
> + InstrPopStackResource(instr);
>
> - if (instr->need_walusage)
> - WalUsageAccumDiff(&instr->walusage,
> - &pgWalUsage, &instr->walusage_start);
> + if (finalize)
> + {
> + instr->finalized = true;
> + if (pgInstrStack)
> + InstrStackAdd(pgInstrStack, &instr->stack);
> + }
> }
>
> /* Allocate new node instrumentation structure(s) */
> @@ -139,12 +234,14 @@ InstrStartNode(NodeInstrumentation * instr)
> !INSTR_TIME_SET_CURRENT_LAZY(instr->starttime))
> elog(ERROR, "InstrStartNode called twice in a row");
>
> - /* save buffer usage totals at node entry, if needed */
> - if (instr->need_bufusage)
> - instr->bufusage_start = pgBufferUsage;
> + if (instr->need_bufusage || instr->need_walusage)
> + {
> + /* Ensure that we always have a parent, even at the top most node */
> + Assert(pgInstrStack != NULL);
>
> - if (instr->need_walusage)
> - instr->walusage_start = pgWalUsage;
> + instr->stack.previous = pgInstrStack;
> + pgInstrStack = &instr->stack;
> + }
> }
>
> /* Exit from a plan node */
> @@ -169,14 +266,12 @@ InstrStopNode(NodeInstrumentation * instr, double nTuples)
> INSTR_TIME_SET_ZERO(instr->starttime);
> }
>
> - /* Add delta of buffer usage since entry to node's totals */
> - if (instr->need_bufusage)
> - BufferUsageAccumDiff(&instr->bufusage,
> - &pgBufferUsage, &instr->bufusage_start);
> -
> - if (instr->need_walusage)
> - WalUsageAccumDiff(&instr->walusage,
> - &pgWalUsage, &instr->walusage_start);
> + if (instr->need_bufusage || instr->need_walusage)
> + {
> + /* Ensure that we always have a parent, even at the top most node */
> + Assert(instr->stack.previous != NULL);
> + pgInstrStack = instr->stack.previous;
> + }
>
> /* Is this the first tuple of this cycle? */
> if (!instr->running)
> @@ -253,10 +348,20 @@ InstrAggNode(NodeInstrumentation * dst, NodeInstrumentation * add)
>
> /* Add delta of buffer usage since entry to node's totals */
> if (dst->need_bufusage)
> - BufferUsageAdd(&dst->bufusage, &add->bufusage);
> + BufferUsageAdd(&dst->stack.bufusage, &add->stack.bufusage);
>
> if (dst->need_walusage)
> - WalUsageAdd(&dst->walusage, &add->walusage);
> + WalUsageAdd(&dst->stack.walusage, &add->stack.walusage);
> +}
> +
> +void
> +InstrStackAdd(InstrStack * dst, InstrStack * add)
> +{
> + Assert(dst != NULL);
> + Assert(add != NULL);
> +
> + BufferUsageAdd(&dst->bufusage, &add->bufusage);
> + WalUsageAdd(&dst->walusage, &add->walusage);
> }
>

Do we want to do BufferUsageAdd() etc even if we are not tracking buffer
usage? Those operations aren't cheap...

> /* note current values during parallel executor startup */
> @@ -281,6 +386,14 @@ InstrEndParallelQuery(BufferUsage *bufusage, WalUsage *walusage)
> void
> InstrAccumParallelQuery(BufferUsage *bufusage, WalUsage *walusage)
> {
> + if (pgInstrStack != NULL)
> + {
> + InstrStack *dst = pgInstrStack;
> +
> + BufferUsageAdd(&dst->bufusage, bufusage);
> + WalUsageAdd(&dst->walusage, walusage);
> + }
> +
> BufferUsageAdd(&pgBufferUsage, bufusage);
> WalUsageAdd(&pgWalUsage, walusage);
> }

Is the pgInstrStack == NULL case actually reachable?

> diff --git a/src/include/executor/instrument.h b/src/include/executor/instrument.h
> index 78d3653997b..d04607ce40c 100644
> --- a/src/include/executor/instrument.h
> +++ b/src/include/executor/instrument.h
> @@ -14,6 +14,7 @@
> #define INSTRUMENT_H
>
> #include "portability/instr_time.h"
> +#include "utils/resowner.h"

I'd probably not include resowner here but just forward declare the typedef.

> @@ -146,26 +161,46 @@ extern void InstrEndParallelQuery(BufferUsage *bufusage, WalUsage *walusage);
> extern void InstrAccumParallelQuery(BufferUsage *bufusage, WalUsage *walusage);
> extern void BufferUsageAccumDiff(BufferUsage *dst,
> const BufferUsage *add, const BufferUsage *sub);
> +extern void InstrStackAdd(InstrStack * dst, InstrStack * add);
> extern void WalUsageAccumDiff(WalUsage *dst, const WalUsage *add,
> const WalUsage *sub);
>
> +#define INSTR_GET_BUFUSAGE(instr) \
> + instr->stack.bufusage
> +
> +#define INSTR_GET_WALUSAGE(instr) \
> + instr->stack.walusage

Not convinced that having these macros is worthwhile.

At this point I reached return -ENEEDCOFFEE :)

Greetings,

Andres Freund

In response to

Re: Stack-based tracking of per-node WAL/buffer usage at 2025-10-22 11:28:24 from Lukas Fittl

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	BharatDB	2025-10-22 12:59:25	Re: [PATCH] Fix pg_rewind false positives caused by shutdown-only WAL
Previous Message	Jelte Fennema-Nio	2025-10-22 12:44:15	Re: RFC: adding pytest as a supported test framework