Re: Adding wait events statistics

From: Andres Freund <andres(at)anarazel(dot)de>
To: Bertrand Drouvot <bertranddrouvot(dot)pg(at)gmail(dot)com>
Cc: pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject: Re: Adding wait events statistics
Date: 2025-07-09 02:19:07
Message-ID: lt6n664ijbmfftnuv3bgvt47q7kjz4tflu4kg3ingv6njjtvld@kesknxnidemo
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

On 2025-06-30 13:36:12 +0000, Bertrand Drouvot wrote:
> Wait events are useful to know what backends are waiting for when there is/was
> a performance issue: for this we can sample pg_stat_activity at regular intervals
> and record historical data. That’s how it is commonly used.
>
> It could also be useful to observe the engine/backends behavior over time and
> help answer questions like:
>
> * Is the engine’s wait pattern the same over time?
> * Is application’s "A" wait pattern the same over time?
> * I observe a peak in wait event "W": is it because "W" is now waiting longer or
> is it because it is hit more frequently?
> * I observe a peak in some of them (say for example MultiXact%), is it due to a
> workload change?
>
> For the above use cases, we need a way to track the wait events that occur between
> samples: please find attached a proof of concept patch series doing so.
> >
> 0002 - It adds wait events statistics
>
> It adds a new stat kind PGSTAT_KIND_WAIT_EVENT for the wait event statistics.
>
> This new statistic kind is a fixed one because we know the maximum number of wait
> events. Indeed:
>
> * it does not take into account custom wait events as extensions have all they need
> at their disposal to create custom stats on their own wait events should they
> want to (limited by WAIT_EVENT_CUSTOM_HASH_MAX_SIZE though).
>
> * it does not take into account LWLock > LWTRANCHE_FIRST_USER_DEFINED for the same
> reasons as above. That said, there is no maximum limitation in LWLockNewTrancheId().
>
> * we don't want to allocate memory in some places where the counters are
> incremented (see 4feba03d8b9). We could still use the same implementation as for
> backend statistics (i.e, make use of flush_static_cb) if we really want/need to
> switch to variable stats.

Even without allocating memory, this makes wait events considerably more
expensive. Going from ~4 instructions to something more like ~40. Including
external function calls.

Today:

start wait:
0x0000000000a08e8f <+63>: lea 0x735662(%rip),%r12 # 0x113e4f8 <my_wait_event_info>
0x0000000000a08e96 <+70>: mov (%r12),%rax
0x0000000000a08e9a <+74>: movl $0x9000006,(%rax)

< do wait operation >

end wait:
0x0000000000a08eaa <+90>: mov (%r12),%rax

After:

start wait:
0x0000000000a0b39f <+63>: lea 0x73673a(%rip),%r12 # 0x1141ae0 <my_wait_event_info>
0x0000000000a0b3a6 <+70>: mov (%r12),%rax
0x0000000000a0b3aa <+74>: movl $0x9000006,(%rax)

< do wait operation >

end wait:

0x0000000000a0b3ba <+90>: mov (%r12),%rax
0x0000000000a0b3be <+94>: mov (%rax),%edi
0x0000000000a0b3c0 <+96>: rex call 0xa452e0 <waitEventIncrementCounter>
0x0000000000a0b3c6 <+102>: mov (%r12),%rax

Where waitEventIncrementCounter ends up as:

0x0000000000a452e0 <+0>: lea 0x6fc7f9(%rip),%rax # 0x1141ae0 <my_wait_event_info>
0x0000000000a452e7 <+7>: mov (%rax),%rax
0x0000000000a452ea <+10>: mov (%rax),%eax
0x0000000000a452ec <+12>: mov %eax,%edx
0x0000000000a452ee <+14>: and $0xff000000,%edx
0x0000000000a452f4 <+20>: jne 0xa45300 <waitEventIncrementCounter+32>
0x0000000000a452f6 <+22>: test %ax,%ax
0x0000000000a452f9 <+25>: jne 0xa45300 <waitEventIncrementCounter+32>
0x0000000000a452fb <+27>: ret
0x0000000000a452fc <+28>: nopl 0x0(%rax)
0x0000000000a45300 <+32>: cmp $0x1000000,%edx
0x0000000000a45306 <+38>: jne 0xa4530e <waitEventIncrementCounter+46>
0x0000000000a45308 <+40>: cmp $0x5e,%ax
0x0000000000a4530c <+44>: ja 0xa452fb <waitEventIncrementCounter+27>
0x0000000000a4530e <+46>: cmp $0x7000000,%edx
0x0000000000a45314 <+52>: sete %cl
0x0000000000a45317 <+55>: test %ax,%ax
0x0000000000a4531a <+58>: setne %al
0x0000000000a4531d <+61>: test %al,%cl
0x0000000000a4531f <+63>: jne 0xa452fb <waitEventIncrementCounter+27>
0x0000000000a45321 <+65>: cmp $0xb000000,%edx
0x0000000000a45327 <+71>: je 0xa452fb <waitEventIncrementCounter+27>
0x0000000000a45329 <+73>: mov %edi,%eax
0x0000000000a4532b <+75>: movzwl %di,%edi
0x0000000000a4532e <+78>: lea 0x714e4b(%rip),%rdx # 0x115a180 <PendingWaitEventStats>
0x0000000000a45335 <+85>: shr $0x18,%eax
0x0000000000a45338 <+88>: shl $0x5,%rax
0x0000000000a4533c <+92>: add 0x4e1735(%rip),%rax # 0xf26a78
0x0000000000a45343 <+99>: movslq 0x8(%rax),%rax
0x0000000000a45347 <+103>: add %rdi,%rax
0x0000000000a4534a <+106>: incq (%rdx,%rax,8)
0x0000000000a4534e <+110>: lea 0x714e0b(%rip),%rax # 0x115a160 <have_wait_event_stats>
0x0000000000a45355 <+117>: movb $0x1,(%rax)
0x0000000000a45358 <+120>: ret

That's about an order of magnitude increase in overhead, just based on the
number of instructions, and probably considerably xmore in runtime (you're
adding external function calls etc).

I think to make this viable, we would have to remove some wait events and be
more careful about adding them. And optimize this code a good bit. This
really shouldn't end up with this mouch code.

Greetings,

Andres Freund

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message jian he 2025-07-09 02:33:49 Re: [PATCH] Prevent replacement of a function if it's used in an index expression and is not IMMUTABLE
Previous Message suyu.cmj 2025-07-09 01:57:08 Re: The same 2PC data maybe recovered twice