Quick Links

Re: Stack-based tracking of per-node WAL/buffer usage

From:	Andres Freund <andres(at)anarazel(dot)de>
To:	Lukas Fittl <lukas(at)fittl(dot)com>
Cc:	PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject:	Re: Stack-based tracking of per-node WAL/buffer usage
Date:	2025-09-04 20:23:06
Message-ID:	h7zq426fhzt3oqkps53qxfwl3dlz3ifcwqmybuvxntlcaqwb33@m6huopxovgle
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

Hi,

On 2025-08-31 16:57:01 -0700, Lukas Fittl wrote:
> Please find attached a patch series that introduces a new paradigm for how
> per-node WAL/buffer usage is tracked, with two primary goals: (1) reduce
> overhead of EXPLAIN ANALYZE, (2) enable future work like tracking estimated
> distinct buffer hits [0].

I like this for a third reason: To separate out buffer access statistics for
the index and the table in index scans. Right now it's very hard to figure out
if a query is slow because of the index lookups or finding the tuples in the
table.

> 0001: Separate node instrumentation from other use of Instrumentation struct
>
> Previously different places (e.g. query "total time") were repurposing
> the per-node Instrumentation struct. Instead, simplify the Instrumentation
> struct to only track time, WAL/buffer usage, and tuple counts. Similarly,
> drop the use of InstrEndLoop outside of per-node instrumentation. Introduce
> the NodeInstrumentation struct to carry forward the per-node
> instrumentation information.

It's mildly odd that the two types of instrumentation have a different
definition of 'total' (one double, one instr_time).

> 0003: Introduce stack for tracking per-node WAL/buffer usage

> From 4375fcb4141f18d6cd927659970518553aa3fe94 Mon Sep 17 00:00:00 2001
> From: Lukas Fittl <lukas(at)fittl(dot)com>
> Date: Sun, 31 Aug 2025 16:37:05 -0700
> Subject: [PATCH v1 3/3] Introduce stack for tracking per-node WAL/buffer usage

> diff --git a/src/backend/executor/execMain.c b/src/backend/executor/execMain.c
> index b83ced9a57a..1c2268bc608 100644
> --- a/src/backend/executor/execMain.c
> +++ b/src/backend/executor/execMain.c
> @@ -312,6 +312,7 @@ standard_ExecutorRun(QueryDesc *queryDesc,
> DestReceiver *dest;
> bool sendTuples;
> MemoryContext oldcontext;
> + InstrStackResource *res;
>
> /* sanity checks */
> Assert(queryDesc != NULL);
> @@ -333,6 +334,9 @@ standard_ExecutorRun(QueryDesc *queryDesc,
> if (queryDesc->totaltime)
> InstrStart(queryDesc->totaltime);
>
> + /* Start up per-query node level instrumentation */
> + res = InstrStartQuery();
> +
> /*
> * extract information from the query descriptor and the query feature.
> */
> @@ -382,6 +386,9 @@ standard_ExecutorRun(QueryDesc *queryDesc,
> if (sendTuples)
> dest->rShutdown(dest);
>
> + /* Shut down per-query node level instrumentation */
> + InstrShutdownQuery(res);
> +
> if (queryDesc->totaltime)
> InstrStop(queryDesc->totaltime, estate->es_processed);

Why are we doing Instr{Start,Stop}Query when not using instrumentation?
Resowner stuff ain't free, so I'd skip them when not necessary.

> diff --git a/src/backend/executor/execProcnode.c b/src/backend/executor/execProcnode.c
> index d286471254b..7436f307994 100644
> --- a/src/backend/executor/execProcnode.c
> +++ b/src/backend/executor/execProcnode.c
> @@ -823,8 +823,17 @@ ExecShutdownNode_walker(PlanState *node, void *context)
>
> /* Stop the node if we started it above, reporting 0 tuples. */
> if (node->instrument && node->instrument->running)
> + {
> InstrStopNode(node->instrument, 0);
>
> + /*
> + * Propagate WAL/buffer stats to the parent node on the
> + * instrumentation stack (which is where InstrStopNode returned us
> + * to).
> + */
> + InstrNodeAddToCurrent(&node->instrument->stack);
> + }
> +
> return false;
> }

Can we rely on this being reached? Note that ExecutePlan() calls
ExecShutdownNode() conditionally:

/*
* If we know we won't need to back up, we can release resources at this
* point.
*/
if (!(estate->es_top_eflags & EXEC_FLAG_BACKWARD))
ExecShutdownNode(planstate);

> +static void
> +ResOwnerReleaseInstrStack(Datum res)
> +{
> + /*
> + * XXX: Registered resources are *not* called in reverse order, i.e. we'll
> + * get what was first registered first at shutdown. To avoid handling
> + * that, we are resetting the stack here on abort (instead of recovering
> + * to previous).
> + */
> + pgInstrStack = NULL;
> +}

Hm, doesn't that mean we loose track of instrumentation if you e.g. do an
EXPLAIN ANALYZE of a query that executes a function, which internally triggers
an error and catches it?

I wonder if the solution could be to walk the stack and search for the
to-be-released element.

Greetings,

Andres Freund

In response to

Stack-based tracking of per-node WAL/buffer usage at 2025-08-31 23:57:01 from Lukas Fittl

Responses

Re: Stack-based tracking of per-node WAL/buffer usage at 2025-09-09 19:35:43 from Lukas Fittl

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Nathan Bossart	2025-09-04 20:37:51	Re: Improve LWLock tranche name visibility across backends
Previous Message	Tomas Vondra	2025-09-04 18:55:19	Re: index prefetching