Re: Stack-based tracking of per-node WAL/buffer usage

From: Lukas Fittl <lukas(at)fittl(dot)com>
To: Andres Freund <andres(at)anarazel(dot)de>, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
Cc: PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Zsolt Parragi <zsolt(dot)parragi(at)percona(dot)com>, Tomas Vondra <tomas(at)vondra(dot)me>, Peter Smith <smithpb2250(at)gmail(dot)com>
Subject: Re: Stack-based tracking of per-node WAL/buffer usage
Date: 2026-04-06 09:26:20
Message-ID: CAP53Pkw1KUej7FywhXr0tH76mJRSVeysVcGxzFNdZCoVcYpMNQ@mail.gmail.com
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Sun, Apr 5, 2026 at 5:31 AM Lukas Fittl <lukas(at)fittl(dot)com> wrote:
>
> On Sat, Apr 4, 2026 at 12:39 PM Andres Freund <andres(at)anarazel(dot)de> wrote:
> >
> > > @@ -247,9 +248,19 @@ standard_ExecutorStart(QueryDesc *queryDesc, int eflags)
> > > estate->es_snapshot = RegisterSnapshot(queryDesc->snapshot);
> > > estate->es_crosscheck_snapshot = RegisterSnapshot(queryDesc->crosscheck_snapshot);
> > > estate->es_top_eflags = eflags;
> > > - estate->es_instrument = queryDesc->instrument_options;
> > > estate->es_jit_flags = queryDesc->plannedstmt->jitFlags;
> > >
> > > + /*
> > > + * Set up query-level instrumentation if needed. We do this before
> > > + * InitPlan so that node and trigger instrumentation can be allocated
> > > + * within the query's dedicated instrumentation memory context.
> > > + */
> > > + if (!queryDesc->totaltime && queryDesc->instrument_options)
> > > + {
> > > + queryDesc->totaltime = InstrQueryAlloc(queryDesc->instrument_options);
> > > + estate->es_instrument = queryDesc->totaltime;
> > > + }
> > > +
> > > /*
> > > * Set up an AFTER-trigger statement context, unless told not to, or
> > > * unless it's EXPLAIN-only mode (when ExecutorFinish won't be called).
> >
> > It seems pretty weird to still have queryDesc->totaltime *sometimes* created
> > by pgss etc, but also create it in standard_ExecutorStart if not already
> > created. What if the explain options aren't compatible? Sure
> > pgss/auto_explain use ALL, but that's not a given.
>
> Yeah, I think in practice all use cases I've ever seen pass
> INSTRUMENT_ALL (and in fact it won't behave sane if this differs
> between extensions), but you're right there is no guarantee.
>
> Overall, there are two aspects to this:
>
> 1) Query instrumentation as the parent for node instrumentation,
> driven by use of EXPLAIN or auto_explain setting
> queryDesc->instrument_options
>
> 2) Instrumentation as a mechanism to measure the activity of a query,
> as used by pg_stat_statements or auto_explain (to get the runtime /
> aggregate buffer usage)
>
> I could see two solutions:
>
> A) Keep two separate QueryInstrumentations (EXPLAIN/auto_explain get
> es_instrument, any extensions measuring aggregate activity get
> query->totaltime)
>
> B) Have one internal QueryInstrumentation (that's responsible to be
> the abort "parent" to both node instrumentation, and query->totaltime)
>
> I was initially thinking we could maybe combine them creatively (i.e.
> expand on what we've done so far), but I'm not sure there is a
> reasonable design that isn't convoluted. We could also have a way for
> extensions to "request" a certain level of instrumentation (instead of
> directly allocating it), but it seems the current hooks are
> insufficient for that.
>
> I've gone with solution (A) for now, with es_instrument being
> allocated when per-node instrumentation is needed. Obviously that gets
> us two ResOwner cleanups instead of one when e.g. auto_explain is
> active, but I think that's still acceptable. It also shows how its
> easy to do an extra level of nesting with the stack-based
> instrumentation, without too much expense.
>
> With this in place, I do wonder if we should avoid the full memory
> context setup in InstrQueryAlloc (i.e. instead just make a direct
> allocation), unless we know that children are going to be attached.
> The downside of that would be that we can't just re-assign the
> instr_cxt in InstrQueryStopFinalize (we'd have to go back to the
> previous logic of doing a memcpy into the callers context, for the
> no-children case), but it might make a notable performance difference?

I've done a stress test of the logic I had added here in v13 (two
separate QueryInstrumentations to not mess with query->totaltime),
specifically "pgbench -n -j 32 -c 32 -f select1.sql -T 60 postgres"
with auto_explain enabled with all log_* settings (so its both
exercising query->totaltime and instrument_options), and unfortunately
that showed about a 1 to 2% impact.

So I don't think this was the right direction. I'll go back to what I
had before, but fix the specific issue you pointed out when
instrumentation options differ. Specifically, I'll add a preparatory
patch to stop extensions from allocating queryDesc->totaltime
themselves, and add queryDesc->totaltime_options that they use to
request which level of totaltime instrumentation they need.

If they request less than INSTRUMENT_ALL, they might still get more
instrumentation actually collected, when the query in question is an
EXPLAIN (ANALYZE). But since they don't have to read those fields from
query->totaltime, I think that's acceptable.

>
> >
> >
> > > @@ -1284,8 +1325,8 @@ InitResultRelInfo(ResultRelInfo *resultRelInfo,
> > > palloc0_array(FmgrInfo, n);
> > > resultRelInfo->ri_TrigWhenExprs = (ExprState **)
> > > palloc0_array(ExprState *, n);
> > > - if (instrument_options)
> > > - resultRelInfo->ri_TrigInstrument = InstrAllocTrigger(n, instrument_options);
> > > + if (qinstr)
> > > + resultRelInfo->ri_TrigInstrument = InstrAllocTrigger(qinstr, n);
> >
> > Hm. Why do we not need to pass down the instrument_options anymore? I guess
> > the assumption is that we always are going to use the flags from qinstr?
> >
> > Is that right? Because right now pgss/auto_explain use _ALL, even when an
> > EXPLAIN ANALYZE doesn't.
> >
>
> With the solution mentioned earlier, where es_instrument is a separate
> allocation, this problem now goes away without any extra changes
> needed.
>
> Overall, I think its reasonable to make node/trigger instrumentation
> be attached to a query instrumentation that has the instrumentation
> options set that should be applied. That way we don't have think about
> edge cases like a query instrumentation that doesn't need a stack, but
> children that do.

Because we now again have a query->totaltime that may have more
instrumentation_options than the per-node/per-trigger instrumentation
need, we need to explicitly pass the instrumentation options that were
requested.

To support that I'll bring back "es_instrument" with its prior
meaning, and instead add "es_query_instr" to pass down the query
instrumentation to the trigger instrumentation calls.

> >
> > > From 16e44d5508f91dd23da780901f3ec0126965628d Mon Sep 17 00:00:00 2001
> > > From: Lukas Fittl <lukas(at)fittl(dot)com>
> > > Date: Sat, 7 Mar 2026 17:52:24 -0800
> > > Subject: [PATCH v12 7/9] instrumentation: Optimize ExecProcNodeInstr
> > > instructions by inlining
> > >
> > > For most queries, the bulk of the overhead of EXPLAIN ANALYZE happens in
> > > ExecProcNodeInstr when starting/stopping instrumentation for that node.
> > >
> > > Previously each ExecProcNodeInstr would check which instrumentation
> > > options are active in the InstrStartNode/InstrStopNode calls, and do the
> > > corresponding work (timers, instrumentation stack, etc.). These
> > > conditionals being checked for each tuple being emitted add up, and cause
> > > non-optimal set of instructions to be generated by the compiler.
> > >
> > > Because we already have an existing mechanism to specify a function
> > > pointer when instrumentation is enabled, we can instead create specialized
> > > functions that are tailored to the instrumentation options enabled, and
> > > avoid conditionals on subsequent ExecProcNodeInstr calls. This results in
> > > the overhead for EXPLAIN (ANALYZE, TIMING OFF, BUFFERS OFF) for a stress
> > > test with a large COUNT(*) that does many ExecProcNode calls from ~ 20% on
> > > top of actual runtime to ~ 3%. When using BUFFERS ON the same query goes
> > > from ~ 20% to ~ 10% on top of actual runtime.
> >
> > I assume this is to a significant degree due to to allowing for inlining. Have
> > you checked how much of the effort you get by just putting ExecProcNodeInstr()
> > into instrument.c?
>
> Worth a try - I haven't tested that yet - I'll come back to this
> separately and verify how much that buys us, vs spelling out the
> different variants.

I've run a test of just putting ExecProcNodeInstr into instrument.c
(and adding an inline keyword to the functions it calls), and it does
help over not doing it at all, but its not the full experience:

CREATE TABLE lotsarows(key int not null);
INSERT INTO lotsarows SELECT generate_series(1, 50000000);
VACUUM FREEZE lotsarows;

EXPLAIN (ANALYZE, ...) SELECT count(*) FROM lotsarows;

Below measurements are best out of three, for these three versions:
(1) with stack only
(2) with stack + move ExecProcNodeInstr with no changes (your idea)
(3) with stack + move + avoid branches (current patch set)

BUFFERS OFF, TIMING OFF:
(1): 309ms
(2): 292ms
(3): 283ms

BUFFERS ON, TIMING OFF:
(1): 322ms
(2): 314ms
(3): 294ms

BUFFERS ON, TIMING ON:
(1): 829ms
(2): 814ms
(3): 803ms

I suspect the discrepancy for BUFFERS in particular is because the
commit has an optimized form of the stack popping (InstrPopStackTo),
but I have not taken a close look at the assembly differences here.
For now I'll keep this as-is, but that can be changed quickly.

Thanks,
Lukas

--
Lukas Fittl

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Jelte Fennema-Nio 2026-04-06 09:37:01 Re: Proposal to allow setting cursor options on Portals
Previous Message vignesh C 2026-04-06 09:15:31 Re: Adding REPACK [concurrently]