Re: Adding wait events statistics

From: Michael Paquier <michael(at)paquier(dot)xyz>
To: Bertrand Drouvot <bertranddrouvot(dot)pg(at)gmail(dot)com>
Cc: pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject: Re: Adding wait events statistics
Date: 2025-07-07 05:56:29
Message-ID: aGthjZHSPmvJxIQX@paquier.xyz
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Mon, Jun 30, 2025 at 01:36:12PM +0000, Bertrand Drouvot wrote:
> It could also be useful to observe the engine/backends behavior over time and
> help answer questions like:
>
> * Is the engine’s wait pattern the same over time?
> * Is application’s "A" wait pattern the same over time?
> * I observe a peak in wait event "W": is it because "W" is now waiting longer or
> is it because it is hit more frequently?
> * I observe a peak in some of them (say for example MultiXact%), is it due to a
> workload change?

To disclose some information here. This is something that we've found
as useful to know from our user base, an area where Postgres lacks
compared to other projects with similar properties.

> * Do we need to serialize the stats based on their names (as for
> PGSTAT_KIND_REPLSLOT)? This question is the same as "is the ordering preserved
> if file stats format is not changed": I think the answer is yes (see f98dbdeb51d)
> , which means there is no need for to_serialized_name/from_serialized_name.

The serialization is important for the custom wait events, much less
for wait events related to injection points (where we don't really
care) which could be filtered out when reporting the stats. It seems
to me that associating the names (perhaps as $CLASS/$NAME on disk as a
single string, two strings with two lengths are perhaps better) is
going to be needed. When the stats are read, the wait event ID could
be allocated from the startup process in the from_serialized_name
callback associated to the new stats kind, before extensions would try
to allocate it.

> * What if a new wait event is added? We'd need to change the stats file format,
> unless: the wait event stats kind becomes a variable one or we change a bit the
> way fixed stats are written/read to/from the stat file (we could add a new field
> in the PgStat_KindInfo for example).

I am pretty sure that everybody is going to forget about this
dependency, myself included. Please note that we add wait events in
stable branches, sometimes as a portion of a bug fix, and we cannot
bump the version of the stats file or stats would be lost on restart.
So these stats should be variable, going back to the custom wait
events as well. For the stats waiting to be flushed, you could
allocate a single array of WAIT_EVENT_CUSTOM_HASH_MAX_SIZE elements in
the TopMemoryContext of each backend, as we know that the number of
custom wait events is never going to be higher than that, indexed by
ID assigned.

It may be simpler to allocate a series of arrays, one for each class,
as well, sized depending on the wait event enum sizes. These could be
indexed based on the enums generated in wait_event_types.h, minus
their initial value (for the array of events in WaitEventTimeout,
minus PG_WAIT_TIMEOUT needs to be applied). Deriving the contents
that you need for the sizing and indexing of the pending data based on
WaitClassTableEntry seems indeed necessary at some degree. Now it
your proposal feels overcomplicated, and that the names may not be
necessary as we already have calls able to do class name and event
name lookups based on the uint32 class/name values. So we should not
require a second way to do a name <=> key lookup.

The to_serialization_name callback could then call
pgstat_get_wait_event() to get the event names, and
pgstat_get_wait_event_type() for the classes when writing the entries.

> Note: for some backends the wait events stats are not flushed (walwriter for
> example), so we need to find additional places to flush the wait events stats.

This is not new. This limitation exists for some auxiliary processes
or processes within their main loops. As long as we have APIs that
can be called to do the flushes, IMO we are fine, like we do for WAL,
IO, etc.

> It might be better for PGSTAT_KIND_WAIT_EVENT to be a variable sized stats kind.
> That way:
>
> * It would be easier to add custom wait events if we want to
> * It would be possible to add new wait events without having to change the stats
> file format
>
> So adding 0004 to see what it would look like to have a variable sized stats kind
> instead and decide how we want to move forward.

IMO, I think that we should just do that. ABI compatibility and
custom wait events make fixed-sized stats unfit with what you are
trying to achieve here.

> That said it might be better to use all the 64 bits (means not have the half full
> of zeroes) for the hash key (better hashing distribution?): we could imagine
> using something like:
>
> ((uint64) wait_event_info) | (((uint64) wait_event_info) << 32)
>
> for the hash key.

Using uint32 for the hash key is fine. At least it's not an issue for
the pgstats machinery. So I would recommend to stick to
WAIT_EVENT_CLASS_MASK and WAIT_EVENT_ID_MASK for the computation of
the key of the entry inserted in shmem. Then when writing the stats
we can guess what are the class name and event names based on the key.
--
Michael

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message cca5507 2025-07-07 06:00:09 Small optimization with expanding dynamic hash table
Previous Message John Naylor 2025-07-07 05:52:58 Re: cpluspluscheck vs ICU again