Re: Stack-based tracking of per-node WAL/buffer usage

From: Lukas Fittl <lukas(at)fittl(dot)com>
To: PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Andres Freund <andres(at)anarazel(dot)de>
Cc: Peter Smith <smithpb2250(at)gmail(dot)com>
Subject: Re: Stack-based tracking of per-node WAL/buffer usage
Date: 2026-02-24 04:18:34
Message-ID: CAP53PkzGbyeJMLDAcvMRgzXPXYsYXZr3SBg0UwhfkYjqu8oK_g@mail.gmail.com
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

Attached v6:

0001 addresses the issue that Peter raised and corrects the macro name
+ adds a missing comment.

0002/0003 are the same as in v5 and are preparatory commits that
should be in good shape to go.

0004 adds new regression tests that pass on both master and the
patched version, and stress test some of the nested instrumentation
handling.

0005 is what was previously 0003, and introduces the stack-based
instrumentation mechanism. I've made several changes here:

- Rename pgInstrStack => CurrentInstrStack
- Always initialize a top-level stack (TopInstrStack) to avoid
condition when writing to stack
- Rename ExecAccumNodeInstrumentation to
ExecFinalizeNodeInstrumentation, and skip it for parallel workers to
avoid double counting (covered by new test cases). The new
InstrFinalizeNode method that accompanies this now also copies the
per-node instrumentation into the per-query context when in the
regular non-abort case. I think this is necessary when we have a
long-running transaction with auto_explain enabled and many fast
statements, so we don't want to keep NodeInstrumentation until
transaction end (since each NodeInstrumentation is ~288 bytes).
- Accumulate unfinalized per-node stacks on abort (so we can rely on
accumulated totals to be correct) - note that this requires an
additional tree walk in the first ExecutorRun because we lazily
initialize the query->totaltime stack
- Moved the non-node instr stack allocations to TopMemoryContext -
after further testing and review I agree that's necessary because of
utility statements. The specific test case that showed that was
pgss_ProcessUtility and how it behaves with the new utility.sql test
case, as well as VACUUM commands in stream regress.
- Combined parallel worker changes into this commit (it didn't really
make sense to keep these separate based on some of the edge cases)
- Explicit handling of orphaned stack entries when we experience
out-of-order cleanup during abort
- Added a lot more comments (too many?) to explain different edge cases.

0006 could probably be merged into 0005, and is the existing
mechanical change to drop the pgBufferUsage global, but kept separate
for easier review for now.

0007 is the stack-mechanism for splitting out "Table Buffers" for
Index Scans. This has been changed to:

- Report combined index/table buffers as "Buffers" and only show
"Table Buffers" broken out. This was suggested off-list by Andres, and
I've realized this is also otherwise confusing when there are child
nodes below that accumulate up.
- Fix a bug relating to parallel query reporting where previously the
table stack contents would not be copied from parallel workers to the
leader.
- Add documentation and update EXPLAIN ANALYZE examples that reference
Index Scan nodes.

0008 adds a pg_session_buffer_usage() module for testing the global
counters. This helps to verify the handling of aborts behave sanely
(and can run on both master and patched), but I don't think we should
commit this.

---

Two questions on my mind:

1) Should we call this "instrumentation context" (InstrContext?)
instead of "instrumentation stack" (InstrStack)? When writing comments
I found the difference between "stack entry" and "stack" confusing at
times, and "context" feels a bit more clear as a term (e.g. as in
"CurrentInstrContext" or "CurrentInstrumentationContext").

2) For 0007, "Table Buffers" feels a bit inconsistent with "Heap
Fetches" and "Heap Blocks" used elsewhere, should we potentially use
"Heap Buffers", or change the existing names to "Table ..." instead?

And one follow-up on a prior note:

On Tue, Jan 13, 2026 at 12:01 AM Lukas Fittl <lukas(at)fittl(dot)com> wrote:
> Its also worth noting that 2defd0006255 just added a new
> "instrument_node.h". We could consider moving the per-node
> instrumentation structs and associated functions there (but probably
> keep a single "instrument.c" unit file?). Thoughts?

I've consider this again and think that's not a good idea, because
instrument_node.h seems specifically focused on parallel query - so
I've left all new methods/structs in instrument.h for now.

Thanks,
Lukas

--
Lukas Fittl

Attachment Content-Type Size
v6-0004-instrumentation-Add-additional-regression-tests-c.patch application/octet-stream 23.5 KB
v6-0003-instrumentation-Separate-per-node-and-trigger-log.patch application/octet-stream 27.5 KB
v6-0001-instrumentation-Rename-INSTR_TIME_LT-macro-to-INS.patch application/octet-stream 2.2 KB
v6-0005-Optimize-measuring-WAL-buffer-usage-through-stack.patch application/octet-stream 49.0 KB
v6-0002-instrumentation-Replace-direct-changes-of-pgBuffe.patch application/octet-stream 9.4 KB
v6-0006-Convert-remaining-users-of-pgBufferUsage-to-use-I.patch application/octet-stream 15.5 KB
v6-0008-Add-pg_session_buffer_usage-contrib-module.patch application/octet-stream 25.6 KB
v6-0007-Index-scans-Show-table-buffer-accesses-separately.patch application/octet-stream 19.2 KB

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Chao Li 2026-02-24 04:25:19 Re: Modernize error message for malformed B-Tree tuple posting
Previous Message Michael Paquier 2026-02-24 04:09:33 Re: Re: A out of date comment of WaitForWALToBecomeAvailable