Re: Stack-based tracking of per-node WAL/buffer usage

From: Lukas Fittl <lukas(at)fittl(dot)com>
To: Andres Freund <andres(at)anarazel(dot)de>
Cc: Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Tomas Vondra <tomas(at)vondra(dot)me>, Peter Smith <smithpb2250(at)gmail(dot)com>, Zsolt Parragi <zsolt(dot)parragi(at)percona(dot)com>
Subject: Re: Stack-based tracking of per-node WAL/buffer usage
Date: 2026-04-07 22:27:45
Message-ID: CAP53PkzpJ39HFRGaMR7n5kSsNuqui3TJ3Gw_FU7k_qjdBvT6hg@mail.gmail.com
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Tue, Apr 7, 2026 at 3:19 PM Andres Freund <andres(at)anarazel(dot)de> wrote:
>
> Hi,
>
> On 2026-04-07 13:30:11 -0700, Lukas Fittl wrote:
> > 0001 is the change to make queryDesc->totaltime be allocated by
> > ExecutorStart instead of plugins themselves, and adds a
> > queryDesc->totaltime_options to have plugins request which level of
> > summary instrumentation they need. This change is pretty simple, and
> > could still make sense to get into 19. Because of the earlier
> > Instrumentation refactoring that was pushed (thanks!) we're already
> > asking extensions allocating queryDesc->totaltime to modify their use
> > of InstrAlloc, so I think we might as well clean this up now.
>
> Hm. That's a fair argument. They indeed would have to again change next
> release
>
> It's not a complicated change and removes more lines than it adds.
>
> I guess one thing I'm not sure is whether the fields shouldn't be renamed at
> the same time:
>
> a) To prevent extensions from continuing to set it, most of them do not test
> against assert enabled builds. With a different name they would get a
> compiler error.
>
> b) "totaltime" and "totaltime_options" are pretty poor descriptors of tracking
> query level statistics. If everyone has to change anyway, this is a good
> occasion.
>
> 'query_instr[_options]'?
>
>
> Any opinions?

I think renaming makes sense - both to make sure extensions reconsider
how they use it, and because "totaltime" is a bad name anyway, because
its not just about timing (and hasn't been for many releases).

"query_instr[_options]" seems reasonable to me, although we could drop
the "query_" since it'd be "queryDesc->query_instr" vs
"queryDesc->instr".

>
> > 0002 is just ExecProcNodeInstr moved to instrument.c, as Andres had
> > suggested previously. We still get some quick performance wins from
> > doing that (see end of email), and again, its a simple change, so
> > could be considered if someone has bandwidth remaining. I've added a
> > later patch that then does the more complex inlining and gets us the
> > full speed up.
>
> Here it needs a few more inlines to get the full performance, otherwise it
> doesn't inline all the helpers. I think on balance I didn't like the
> prototype in instrument.h, that's too widely included, and it might even cause
> some circularity issues. It seems better to do the somewhat ugly thing and
> have the prototype be in executor.h.

Yeah, that makes sense.

>
> > 0002 measurements (with current master and TSC clock source used for
> > timing, best of three):
> >
> > CREATE TABLE lotsarows(key int not null);
> > INSERT INTO lotsarows SELECT generate_series(1, 50000000);
> > VACUUM FREEZE lotsarows;
>
> With the somewhat more extreme benchmark I used in the rdtsc thread and the
> added inline mentioned above I see a bit bigger wins. See the attached
> explainbench.sql - it doesn't quite cover all the combinations, but I think it
> gives a good enough overview.
>
> c=1 pgbench -f ~/tmp/explainbench.sql -P5 -r -t 10
>
> master:
> statement latencies in milliseconds and failures:
> 200.800 0 SELECT pg_prewarm('pgbench_accounts');
> 0.098 0 PREPARE query AS SELECT * FROM pgbench_accounts OFFSET 5000000 LIMIT 1;
> 212.010 0 EXPLAIN (ANALYZE, BUFFERS OFF, WAL OFF, TIMING OFF)
> 268.648 0 EXPLAIN (ANALYZE, BUFFERS OFF, WAL OFF, TIMING ON)
> 232.421 0 EXPLAIN (ANALYZE, BUFFERS ON, WAL ON, TIMING OFF)
> 283.531 0 EXPLAIN (ANALYZE, BUFFERS ON, WAL ON, TIMING ON)
> 0.030 0 DEALLOCATE query;
>
>
> 0002:
>
> statement latencies in milliseconds and failures:
> 201.558 0 SELECT pg_prewarm('pgbench_accounts');
> 0.103 0 PREPARE query AS SELECT * FROM pgbench_accounts OFFSET 5000000 LIMIT 1;
> 188.696 0 EXPLAIN (ANALYZE, BUFFERS OFF, WAL OFF, TIMING OFF)
> 244.479 0 EXPLAIN (ANALYZE, BUFFERS OFF, WAL OFF, TIMING ON)
> 223.773 0 EXPLAIN (ANALYZE, BUFFERS ON, WAL ON, TIMING OFF)
> 266.947 0 EXPLAIN (ANALYZE, BUFFERS ON, WAL ON, TIMING ON)
> 0.034 0 DEALLOCATE query;
>
> That's something like 4-12%.
>
> Pretty nice for a patch that just adds a few lines around and adds a few
> inlines.

Agreed.

> > @@ -334,6 +334,9 @@ explain_ExecutorStart(QueryDesc *queryDesc, int eflags)
> >
> > if (auto_explain_enabled())
> > {
> > + /* We're always interested in runtime */
> > + queryDesc->totaltime_options |= INSTRUMENT_TIMER;
>
> > - queryDesc->totaltime = InstrAlloc(INSTRUMENT_ALL);
>
> Not that it's going to make a significant difference, but it is nice that this
> now would need to track less.

Yup.

>
> Kinda wonder about having
> EXPLAIN (ANALYZE BUFFERS totals_only, WAL totals_only) ...;
>
> in plenty cases that'd be all one needs, at substantially lower cost.

True. I don't like the name "totals_only", but I like the concept.
Today someone has to go to pg_stat_statements to get just the total
numbers, without running them for all nodes with EXPLAIN ANALYZE (and
incurring its overhead).

Thanks,
Lukas

--
Lukas Fittl

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Heikki Linnakangas 2026-04-07 22:51:39 Re: Optimization of the is_normalized() function.
Previous Message Andreas Karlsson 2026-04-07 22:26:05 Re: updates for handling optional argument in system functions