Re: Stack-based tracking of per-node WAL/buffer usage

From: Lukas Fittl <lukas(at)fittl(dot)com>
To: Andres Freund <andres(at)anarazel(dot)de>
Cc: PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: Stack-based tracking of per-node WAL/buffer usage
Date: 2025-10-31 07:18:04
Message-ID: CAP53PkzZ3UotnRrrnXWAv=F4avRq9MQ8zU+bxoN9tpovEu6fGQ@mail.gmail.com
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi Andres,

Thanks for the detailed review!

Attached v4 patchset that addresses feedback (unless otherwise noted below)
and is rebased on master. Other changes:

- Ensured each patch is individually pgindent clean (and compiles)
- Refactored 0003 a bit to consistently use InstrPushStack/InstrPopStack
helpers for modifying the active stack entry
- Building on that refactoring, merged v3/0006 "Introduce alternate
Instrumentation stack mechanism relying on PG_FINALLY" into the main commit
introducing the stack mechanism (the "alternate" mechanism is just using
these helpers directly and making sure InstrPopStack is called via
PG_FINALLY, instead of using resource owners)
- Per our off-list conversation at PGConf.EU, added a patch (v4/0007) that
illustrates how the stack mechanism can be used to separate index and table
buffer accesses in the EXPLAIN for Index Scans

On Wed, Oct 22, 2025 at 5:59 AM Andres Freund <andres(at)anarazel(dot)de> wrote:

> > +/*
> > + * General purpose instrumentation that can capture time, WAL/buffer
> usage and tuples
> > + *
> > + * Initialized through InstrAlloc, followed by one or more calls to a
> pair of
> > + * InstrStart/InstrStop (activity is measured inbetween).
> > + */
> > typedef struct Instrumentation
> > +{
> > + /* Parameters set at creation: */
> > + bool need_timer; /* true if we need timer
> data */
> > + bool need_bufusage; /* true if we need buffer usage
> data */
> > + bool need_walusage; /* true if we need WAL usage data
> */
> > + /* Internal state keeping: */
> > + instr_time starttime; /* start time of last
> InstrStart */
> > + BufferUsage bufusage_start; /* buffer usage at start */
> > + WalUsage walusage_start; /* WAL usage at start */
> > + /* Accumulated statistics: */
> > + instr_time total; /* total runtime */
> > + double ntuples; /* total tuples counted in
> InstrStop */
> > + BufferUsage bufusage; /* total buffer usage */
> > + WalUsage walusage; /* total WAL usage */
> > +} Instrumentation;
>
> Maybe add a comment explaining why ntuples is in here?
>

After thinking about this some more, I'd think we should just go ahead and
special case trigger instrumentation, and specifically count firings of the
trigger (which was counted in "ntuples" before).

I've adjusted the 0002 patch accordingly to split out both node and trigger
instrumentation.

>
>
>
> > ---
> > .../pg_stat_statements/pg_stat_statements.c | 4 +-
> > src/backend/commands/explain.c | 8 +-
> > src/backend/commands/trigger.c | 4 +-
> > src/backend/executor/execMain.c | 25 ++-
> > src/backend/executor/execProcnode.c | 29 +++
> > src/backend/executor/instrument.c | 199 ++++++++++++++----
> > src/include/executor/executor.h | 1 +
> > src/include/executor/instrument.h | 53 ++++-
> > 8 files changed, 260 insertions(+), 63 deletions(-)
> >
> > diff --git a/contrib/pg_stat_statements/pg_stat_statements.c
> b/contrib/pg_stat_statements/pg_stat_statements.c
> > index f43a33b3787..eeabd820d8e 100644
> > --- a/contrib/pg_stat_statements/pg_stat_statements.c
> > +++ b/contrib/pg_stat_statements/pg_stat_statements.c
> > @@ -1089,8 +1089,8 @@ pgss_ExecutorEnd(QueryDesc *queryDesc)
> > PGSS_EXEC,
> >
> INSTR_TIME_GET_MILLISEC(queryDesc->totaltime->total),
> > queryDesc->estate->es_total_processed,
> > - &queryDesc->totaltime->bufusage,
> > - &queryDesc->totaltime->walusage,
> > +
> &INSTR_GET_BUFUSAGE(queryDesc->totaltime),
> > +
> &INSTR_GET_WALUSAGE(queryDesc->totaltime),
>
> Getting a pointer to something returned by a macro is a bit ugly... Perhaps
> it'd be better to just pass the &queryDesc->totaltime? But ugh, that's not
> easily possible given how pgss_planner() currently tracks things :(
>
> Maybe it's worth refactoring this a bit in a precursor patch?
>

I've dropped the macro (its just one additional indirection after all) - I
think we could refactor this further (i.e. pass the stack), but that
doesn't seem strictly necessary.

> /*
> ------------------------------------------------------------------------
> > @@ -828,6 +829,34 @@ ExecShutdownNode_walker(PlanState *node, void
> *context)
> > return false;
> > }
> >
> > +/*
> > + * ExecAccumNodeInstrumentation
> > + *
> > + * Accumulate instrumentation stats from all execution nodes to their
> respective
> > + * parents (or the original parent instrumentation stack).
> > + */
> > +void
> > +ExecAccumNodeInstrumentation(PlanState *node)
> > +{
> > + (void) ExecAccumNodeInstrumentation_walker(node, NULL);
> > +}
>
> I wonder if this is too narrow a name. There might be other uses of a pass
> across the node tree at that point. OTOH, it's probably better to just
> rename
> it at that later point.
>

Yeah, I can't think of a better name, so I've left this the same for now.

>
> > + if (node->instrument && node->instrument->stack.previous)
> > + InstrStackAdd(node->instrument->stack.previous,
> &node->instrument->stack);
> > +
> > + return false;
> > +}
>
> E.g. in ExecShutdownNode_walker we use planstate_tree_walker(), but then
> also
> have special handling for a few node types. Do we need something like that
> here too? It probably is ok, but it's worth explicitly checking and
> adding a
> comment.
>

I went through and double checked - we don't need to special case these
node types in my understanding, and I couldn't see any specific cases where
intermediary nodes would be removed in shutdown either (o.e.
causing dropped stats since a node was removed with its associated stats
not being added to the parent yet).

I added a comment to make it clear this must run after ExecShutdownNode.

> > +/*
> > + * Use ResourceOwner mechanism to correctly reset pgInstrStack on abort.
> > + */
> > +static void ResOwnerReleaseInstrumentation(Datum res);
> > +static const ResourceOwnerDesc instrumentation_resowner_desc =
> > +{
> > + .name = "instrumentation",
> > + .release_phase = RESOURCE_RELEASE_BEFORE_LOCKS,
> > + .release_priority = RELEASE_PRIO_FIRST,
> > + .ReleaseResource = ResOwnerReleaseInstrumentation,
> > + .DebugPrint = NULL, /* default message is fine
> */
> > +};
>
> Is there a reason to do the release here before the lock release? And why
> _FIRST?
>

Adjusted to have its own phase, and after lock release.

> > + if (pgInstrStack && !StackIsParent(pgInstrStack, &instr->stack))
> > + pgInstrStack = instr->stack.previous;
>
> Hm - this is effectively O(stack-depth^2), right? It's probably fine,
> given
> that we have fairly limited nesting (explain + pg_stat_statements +
> auto_explain is probably the current max), but seems worth noting in a
> comment?
>

Yeah, I added a comment - I don't see a case where this is a bottleneck
today given the limited nesting of resource owner using stacks.

> > + /*
> > + * Always accumulate all collected stats before the abort, even if
> we
> > + * already walked up the stack with an earlier resource.
> > + */
> > + if (pgInstrStack)
> > + InstrStackAdd(pgInstrStack, &instr->stack);
>
> Why are we accumulating stats in case of errors? It's probably fine, but
> doing
> less as part of cleanup is pre ferrable...
>

In my understanding, we need to do this in case of functions called in a
query that catch a rollback/error, since we'd otherwise not account for
that function's activity as part of the top-level query.

> /* General purpose instrumentation handling */
> > Instrumentation *
> > InstrAlloc(int n, int instrument_options)
> > {
> ...
> > + if (need_buffers || need_wal)
> > + instr = MemoryContextAllocZero(CurTransactionContext, n *
> sizeof(Instrumentation));
> > + else
> > + instr = palloc0(n * sizeof(Instrumentation));
>
> Is this long-lived enough? I'm e.g. wondering about utility statements that
> internally starting transactions, wouldn't that cause problems with a user
> like pgss tracking something like CIC?
>

I ended up refactoring this a bit, since it seemed useful to do an explicit
pfree at InstrStop or when aborting, both to avoid leaks, and to
theoretically support using TopMemoryContext here.

That said, from my testing I think CurTransactionContext is sufficient,
because we just need something that lives long enough during resource owner
abort situations (e.g. per-query context doesn't work, since the abort
frees it before we do our resource owner handling).

The pgss+CIC case isn't relevant here (I think), because utility statements
don't use the resource owner mechanism at all (with the exception of
EXPLAIN which calls into ExecutorStart), instead we use a PG_TRY/PG_FINALLY
in pg_stat_statements to pop the stack.

> > +void
> > +InstrStackAdd(InstrStack * dst, InstrStack * add)
> > +{
> > + Assert(dst != NULL);
> > + Assert(add != NULL);
> > +
> > + BufferUsageAdd(&dst->bufusage, &add->bufusage);
> > + WalUsageAdd(&dst->walusage, &add->walusage);
> > }
> >
>
> Do we want to do BufferUsageAdd() etc even if we are not tracking buffer
> usage? Those operations aren't cheap...
>

I briefly considered whether we could add this to the InstrStack itself
(i.e. whether we actually care about buffer, WAL usage or both), but I
think where it gets messy is that we can have indirect requirements to
track this - you might have pg_stat_statements capturing both, but e.g. the
utility statement being executed only caring about emitting WAL usage in
the log.

I'm also not familiar with an in-core use case today where only WAL (but
not buffers) is needed, short of doing something like "EXPLAIN (ANALYZE,
BUFFERS OFF, WAL ON)" without having pg_stat_statements or similar enabled.
Do you have a specific example where this could help?

> /* note current values during parallel executor startup */
> > @@ -281,6 +386,14 @@ InstrEndParallelQuery(BufferUsage *bufusage,
> WalUsage *walusage)
> > void
> > InstrAccumParallelQuery(BufferUsage *bufusage, WalUsage *walusage)
> > {
> > + if (pgInstrStack != NULL)
> > + {
> > + InstrStack *dst = pgInstrStack;
> > +
> > + BufferUsageAdd(&dst->bufusage, bufusage);
> > + WalUsageAdd(&dst->walusage, walusage);
> > + }
> > +
> > BufferUsageAdd(&pgBufferUsage, bufusage);
> > WalUsageAdd(&pgWalUsage, walusage);
> > }
>
> Is the pgInstrStack == NULL case actually reachable?
>

In my reading of the code, its necessary because we unconditionally track
WAL/buffer usage in parallel workers, even if the leader doesn't actually
need it.

We could be smarter about this (i.e. tell the workers not to collect the
information in the first place), but for now it seemed easiest to just
discard it.

Thanks,
Lukas

--
Lukas Fittl

Attachment Content-Type Size
v4-0003-Replace-direct-changes-of-pgBufferUsage-pgWalUsag.patch application/octet-stream 9.1 KB
v4-0001-Instrumentation-Keep-time-fields-as-instrtime-req.patch application/octet-stream 8.1 KB
v4-0005-Use-Instrumentation-stack-for-parallel-query-aggr.patch application/octet-stream 9.4 KB
v4-0002-Separate-node-and-trigger-instrumentation-from-ot.patch application/octet-stream 25.7 KB
v4-0004-Optimize-measuring-WAL-buffer-usage-through-stack.patch application/octet-stream 29.6 KB
v4-0006-Convert-remaining-users-of-pgBufferUsage-to-use-I.patch application/octet-stream 15.9 KB
v4-0007-Index-scans-Split-heap-and-index-buffer-access-re.patch application/octet-stream 12.2 KB

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Quan Zongliang 2025-10-31 07:27:15 Re: [PATCH] Add pg_get_type_ddl() to retrieve the CREATE TYPE statement
Previous Message Nazir Bilal Yavuz 2025-10-31 06:58:03 Re: Update Windows CI Task Names: Server 2022 + VS 2022 Upgrade