From: | Andres Freund <andres(at)anarazel(dot)de> |
---|---|
To: | Bertrand Drouvot <bertranddrouvot(dot)pg(at)gmail(dot)com> |
Cc: | pgsql-hackers(at)lists(dot)postgresql(dot)org |
Subject: | Re: Adding wait events statistics |
Date: | 2025-07-09 02:19:07 |
Message-ID: | lt6n664ijbmfftnuv3bgvt47q7kjz4tflu4kg3ingv6njjtvld@kesknxnidemo |
Views: | Whole Thread | Raw Message | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
Hi,
On 2025-06-30 13:36:12 +0000, Bertrand Drouvot wrote:
> Wait events are useful to know what backends are waiting for when there is/was
> a performance issue: for this we can sample pg_stat_activity at regular intervals
> and record historical data. That’s how it is commonly used.
>
> It could also be useful to observe the engine/backends behavior over time and
> help answer questions like:
>
> * Is the engine’s wait pattern the same over time?
> * Is application’s "A" wait pattern the same over time?
> * I observe a peak in wait event "W": is it because "W" is now waiting longer or
> is it because it is hit more frequently?
> * I observe a peak in some of them (say for example MultiXact%), is it due to a
> workload change?
>
> For the above use cases, we need a way to track the wait events that occur between
> samples: please find attached a proof of concept patch series doing so.
> >
> 0002 - It adds wait events statistics
>
> It adds a new stat kind PGSTAT_KIND_WAIT_EVENT for the wait event statistics.
>
> This new statistic kind is a fixed one because we know the maximum number of wait
> events. Indeed:
>
> * it does not take into account custom wait events as extensions have all they need
> at their disposal to create custom stats on their own wait events should they
> want to (limited by WAIT_EVENT_CUSTOM_HASH_MAX_SIZE though).
>
> * it does not take into account LWLock > LWTRANCHE_FIRST_USER_DEFINED for the same
> reasons as above. That said, there is no maximum limitation in LWLockNewTrancheId().
>
> * we don't want to allocate memory in some places where the counters are
> incremented (see 4feba03d8b9). We could still use the same implementation as for
> backend statistics (i.e, make use of flush_static_cb) if we really want/need to
> switch to variable stats.
Even without allocating memory, this makes wait events considerably more
expensive. Going from ~4 instructions to something more like ~40. Including
external function calls.
Today:
start wait:
0x0000000000a08e8f <+63>: lea 0x735662(%rip),%r12 # 0x113e4f8 <my_wait_event_info>
0x0000000000a08e96 <+70>: mov (%r12),%rax
0x0000000000a08e9a <+74>: movl $0x9000006,(%rax)
< do wait operation >
end wait:
0x0000000000a08eaa <+90>: mov (%r12),%rax
After:
start wait:
0x0000000000a0b39f <+63>: lea 0x73673a(%rip),%r12 # 0x1141ae0 <my_wait_event_info>
0x0000000000a0b3a6 <+70>: mov (%r12),%rax
0x0000000000a0b3aa <+74>: movl $0x9000006,(%rax)
< do wait operation >
end wait:
0x0000000000a0b3ba <+90>: mov (%r12),%rax
0x0000000000a0b3be <+94>: mov (%rax),%edi
0x0000000000a0b3c0 <+96>: rex call 0xa452e0 <waitEventIncrementCounter>
0x0000000000a0b3c6 <+102>: mov (%r12),%rax
Where waitEventIncrementCounter ends up as:
0x0000000000a452e0 <+0>: lea 0x6fc7f9(%rip),%rax # 0x1141ae0 <my_wait_event_info>
0x0000000000a452e7 <+7>: mov (%rax),%rax
0x0000000000a452ea <+10>: mov (%rax),%eax
0x0000000000a452ec <+12>: mov %eax,%edx
0x0000000000a452ee <+14>: and $0xff000000,%edx
0x0000000000a452f4 <+20>: jne 0xa45300 <waitEventIncrementCounter+32>
0x0000000000a452f6 <+22>: test %ax,%ax
0x0000000000a452f9 <+25>: jne 0xa45300 <waitEventIncrementCounter+32>
0x0000000000a452fb <+27>: ret
0x0000000000a452fc <+28>: nopl 0x0(%rax)
0x0000000000a45300 <+32>: cmp $0x1000000,%edx
0x0000000000a45306 <+38>: jne 0xa4530e <waitEventIncrementCounter+46>
0x0000000000a45308 <+40>: cmp $0x5e,%ax
0x0000000000a4530c <+44>: ja 0xa452fb <waitEventIncrementCounter+27>
0x0000000000a4530e <+46>: cmp $0x7000000,%edx
0x0000000000a45314 <+52>: sete %cl
0x0000000000a45317 <+55>: test %ax,%ax
0x0000000000a4531a <+58>: setne %al
0x0000000000a4531d <+61>: test %al,%cl
0x0000000000a4531f <+63>: jne 0xa452fb <waitEventIncrementCounter+27>
0x0000000000a45321 <+65>: cmp $0xb000000,%edx
0x0000000000a45327 <+71>: je 0xa452fb <waitEventIncrementCounter+27>
0x0000000000a45329 <+73>: mov %edi,%eax
0x0000000000a4532b <+75>: movzwl %di,%edi
0x0000000000a4532e <+78>: lea 0x714e4b(%rip),%rdx # 0x115a180 <PendingWaitEventStats>
0x0000000000a45335 <+85>: shr $0x18,%eax
0x0000000000a45338 <+88>: shl $0x5,%rax
0x0000000000a4533c <+92>: add 0x4e1735(%rip),%rax # 0xf26a78
0x0000000000a45343 <+99>: movslq 0x8(%rax),%rax
0x0000000000a45347 <+103>: add %rdi,%rax
0x0000000000a4534a <+106>: incq (%rdx,%rax,8)
0x0000000000a4534e <+110>: lea 0x714e0b(%rip),%rax # 0x115a160 <have_wait_event_stats>
0x0000000000a45355 <+117>: movb $0x1,(%rax)
0x0000000000a45358 <+120>: ret
That's about an order of magnitude increase in overhead, just based on the
number of instructions, and probably considerably xmore in runtime (you're
adding external function calls etc).
I think to make this viable, we would have to remove some wait events and be
more careful about adding them. And optimize this code a good bit. This
really shouldn't end up with this mouch code.
Greetings,
Andres Freund
From | Date | Subject | |
---|---|---|---|
Next Message | jian he | 2025-07-09 02:33:49 | Re: [PATCH] Prevent replacement of a function if it's used in an index expression and is not IMMUTABLE |
Previous Message | suyu.cmj | 2025-07-09 01:57:08 | Re: The same 2PC data maybe recovered twice |