Re: [RFC PATCH v0 0/7] Add EXPLAIN ANALYZE wait event reporting

From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Ilmar Yunusov <tanswis42(at)gmail(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: [RFC PATCH v0 0/7] Add EXPLAIN ANALYZE wait event reporting
Date: 2026-06-22 21:34:23
Message-ID: CA+TgmoZpp=yEuFDQAm8A3CwL2hOHO4Tv7500V-OF2ddRxSrsBA@mail.gmail.com
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Fri, May 8, 2026 at 7:23 PM Ilmar Yunusov <tanswis42(at)gmail(dot)com> wrote:
> - Is the `WAITS` option name and output shape acceptable, or should this be
> `WAIT_EVENTS` / different labels?

The name seems fine to me. WAIT_EVENTS also seems fine.

> - Is inclusive per-node attribution the right semantic for EXPLAIN?

I think so.

> - Is the fixed 64-entry accumulator plus explicit overflow bucket acceptable?

I'm not sure what you're referring to here.

> - Is the disabled hot-path overhead of checking an exported boolean in
> pgstat_report_wait_start/end acceptable?

I'm extremely skeptical. I'm really not keen to add ANY cycles to
pgstat_report_wait_start/end(). But the even bigger problem is that
turning this on will result in measuring the time a LOT more times
than we do at present, and that has significant overhead that can
distort the results. That is a problem with EXPLAIN ANALYZE, and I
think it could be a hugely crippling problem for EXPLAIN WAITS,
because a single ExecProcNode call could result in many, many I/Os or
LWLock operations.

I have always felt that any kind of wait event statistics should use a
sampling approach. I'd expect a successful patch in this area to work
by using a time interrupt for self-sampling. It might be able to share
some infrastructure with the pending patch to print the plan for a
running query. The patch adds some infrastructure to keep track of the
active querydesc; this patch, in a self-sampling approach, would need
to keep track of the active planstate. Even that is going to be a
non-trivial overhead, I think, but it could be hidden in an
ExecProcNode wrapper so that the cost is 0 when the option is not
enabled.

I doubt that counting the number of times that a wait event occurs
will tell us anything we want to know. I have looked at those numbers
in the past and they seemed meaningless to me. The problem is that the
amount of time you actually wait can be differ by multiple orders of
magnitude. Comparing the number of times that you waited for LWLock #1
and the number of times you waited for LWLock #2 is like comparing the
number of times you wasted some money to the number of times you
accidentally injured yourself. That is, it's meaningless. You can't
say "I wasted money 17 times this week and I only injured myself once,
so the wasting-money is a bigger problem" without knowing anything
about the amount of money wasted or the injury severity. For me,
wasting 17 nickels is a smaller problem than a really annoying
hangnail, but wasting 17 million dollars is a much bigger one.

It would be interesting to see some testing of this patch on complex
plan trees to see what the overhead actually is and how useful the
results look. My initial guess is that the timing will be useful, the
counts will be useless, and the overhead will significantly distort
the results vs. a sampling approach, but I might be wrong.

--
Robert Haas
EDB: http://www.enterprisedb.com

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Andres Freund 2026-06-22 21:50:49 Re: [RFC PATCH v0 0/7] Add EXPLAIN ANALYZE wait event reporting
Previous Message surya poondla 2026-06-22 21:26:45 Re: Handle concurrent drop when doing whole database vacuum