Re: Stack-based tracking of per-node WAL/buffer usage

From: Lukas Fittl <lukas(at)fittl(dot)com>
To: Zsolt Parragi <zsolt(dot)parragi(at)percona(dot)com>, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, Andres Freund <andres(at)anarazel(dot)de>
Cc: Tomas Vondra <tomas(at)vondra(dot)me>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Peter Smith <smithpb2250(at)gmail(dot)com>
Subject: Re: Stack-based tracking of per-node WAL/buffer usage
Date: 2026-03-24 06:03:16
Message-ID: CAP53PkznofNg+ii363QQGoje30nhssuSz_hV5U4YANAt-Yr_Yg@mail.gmail.com
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Mon, Mar 23, 2026 at 1:03 PM Lukas Fittl <lukas(at)fittl(dot)com> wrote:
> FWIW, on the topic of resource owners and allocations, I've done a
> test over the weekend, and here is a question:
>
> It seems we could switch the Instrumentation allocations we're doing
> when inside a portal to PortalContext, and CurrentMemoryContext when
> outside a portal - instead of allocating in
> TopMemoryContext/TopTransactionContext. That works in practice,
> because resource owner cleanup happens before PortalContext cleanup,
> and simplifies the code a bit since we can skip copying into the
> current memory context (because the caller wants to be able to use the
> result after the finalize call). And if we leak we'd only leak until
> PortalContext gets cleaned up, instead of TopMemoryContext.
>
> To expand on that, in the previously posted v9 we have the following
> allocations:
>
> A) InstrStackState allocated under TopMemoryContext (long-lived, never freed)
> B) QueryInstrumentation allocated under TopMemoryContext (short-lived
> during query execution, explicitly freed up on abort or finalize call)
> C) NodeInstrumentation allocated under TopTransactionContext
> (short-lived during query execution, explicitly freed up on abort or
> finalize call)
> D) In other use cases, e.g. ANALYZE command that logs buffer usage,
> QueryInstrumentation allocated under TopMemoryContext (short-lived
> during command execution, explicitly freed up on abort or finalize
> call)
>
> And we could switch it instead to:
>
> A) InstrStackState allocated under TopMemoryContext (long-lived, never freed)
> B) QueryInstrumentation allocated under PortalContext (short-lived
> during query execution, *automatically* freed up on abort, manually on
> ExecutorEnd to avoid waiting for holdable cursors to free
> PortalContext)
> C) NodeInstrumentation allocated under PortalContext (short-lived
> during query execution, *automatically* freed up on abort, manually on
> ExecutorEnd to avoid waiting for holdable cursors to free
> PortalContext)
> D) In other use cases, e.g. ANALYZE command that logs buffer usage,
> QueryInstrumentation allocated under CurrentMemoryContext (short-lived
> during command execution, *automatically* freed up on abort and
> success case)
>
> However, this goes against the principle noted by Heikki over in [0]
> that ResOwners should use TopMemoryContext to avoid relying on the
> ordering of clean up operations.

I've pondered this question more today, and I think maybe this
complexity isn't the right way to approach this.

Instead I've tried introducing a memory context for instrumentation
managed as a resource owner, and I am now (for now) convinced that
this is the right trade-off for the problem at hand.

The benefit of using our own memory context is that we can free it all
at once (which is a lot less brittle when different types of
instrumentation are involved), *and* we can re-assign the context
parent to be that of the current context on finalize, cleanly moving
it out of TopMemoryContext without doing a copy. It also makes it
easier for callers to allocate in the right context, without having to
introduce a bunch more "Alloc" methods (e.g. relevant for the table
stack tracking for index scans). We also have precedence for the use
of small memory contexts in the executor with the existence of
per-tuple memory contexts.

The main downside is that for the cases where we don't have child
instrumentation, but want the resource owner logic (e.g. ANALYZE
command, or regular query execution with pg_stat_statements enabled),
we have more memory overhead: 1kB (ALLOCSET_SMALL_SIZES minimum) for
what could otherwise be ~200B. I think that's probably okay for
current use cases, but we could avoid that by only using the separate
contexts when we have child instrumentations that will be tracked.

See attached v10, rebased, with these additional changes:

In 0001/0002 I've added forward declarations in execnodes.h, which are
necessary since fba4233c8328.

In 0005 (stack-based instrumentation) I've also addressed the
previously raised concerns about trigger and EXPLAIN (SERIALIZE)
handling, and it now treats both kinds as children of the query's
instrumentation context. To assist with initializing that, we have to
add a query instrumentation reference to EState, but I think that's
acceptable. To reduce code churn I've repurposed the existing
es_instrument field for that, and we now remember the instrumentation
options on QueryInstrumentation.

In 0007 (Optimize ExecProcNodeInstr instructions by inlining) I've
adjusted the ExecProcNodeInstr logic to use a single function that
contains the logic, with separate callers that pass in fixed
constants, to let the compiler figure out the different variants with
less code duplication, per an off-list suggestion from Andres.

In 0008 (Index scans: Show table buffer accesses) this now utilizes
the fact that f026fbf059f2 made IndexScanInstrumentation a heap
allocation, and puts that allocation in the instrumentation memory
context, so it can participate directly in the stack with an inlined
Instrumentation field to track table access, avoiding a duplicate
field previously necessary.

Thanks,
Lukas

--
Lukas Fittl

Attachment Content-Type Size
v10-0005-Optimize-measuring-WAL-buffer-usage-through-stac.patch application/octet-stream 85.9 KB
v10-0001-instrumentation-Separate-trigger-logic-from-othe.patch application/octet-stream 10.1 KB
v10-0003-instrumentation-Replace-direct-changes-of-pgBuff.patch application/octet-stream 9.9 KB
v10-0004-instrumentation-Add-additional-regression-tests-.patch application/octet-stream 23.5 KB
v10-0002-instrumentation-Separate-per-node-logic-from-oth.patch application/octet-stream 27.1 KB
v10-0007-instrumentation-Optimize-ExecProcNodeInstr-instr.patch application/octet-stream 11.2 KB
v10-0006-instrumentation-Use-Instrumentation-struct-for-p.patch application/octet-stream 29.2 KB
v10-0009-Add-pg_session_buffer_usage-contrib-module.patch application/octet-stream 29.3 KB
v10-0008-Index-scans-Show-table-buffer-accesses-separatel.patch application/octet-stream 22.7 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Lukas Fittl 2026-03-24 06:09:57 Re: Reduce timing overhead of EXPLAIN ANALYZE using rdtsc?
Previous Message Fujii Masao 2026-03-24 06:00:20 Re: Use SIGTERM instead of SIGUSR1 for slotsync worker to exit during promotion?