| From: | Bertrand Drouvot <bertranddrouvot(dot)pg(at)gmail(dot)com> | 
|---|---|
| To: | Andres Freund <andres(at)anarazel(dot)de> | 
| Cc: | pgsql-hackers(at)lists(dot)postgresql(dot)org | 
| Subject: | Re: Adding wait events statistics | 
| Date: | 2025-07-09 07:26:05 | 
| Message-ID: | aG4ZjQ1qKensHBOY@ip-10-97-1-34.eu-west-3.compute.internal | 
| Views: | Whole Thread | Raw Message | Download mbox | Resend email | 
| Thread: | |
| Lists: | pgsql-hackers | 
Hi,
On Tue, Jul 08, 2025 at 10:19:07PM -0400, Andres Freund wrote:
> On 2025-06-30 13:36:12 +0000, Bertrand Drouvot wrote:
> >  * we don't want to allocate memory in some places where the counters are
> >  incremented (see 4feba03d8b9). We could still use the same implementation as for
> >  backend statistics (i.e, make use of flush_static_cb) if we really want/need to
> >  switch to variable stats.
> 
> Even without allocating memory, this makes wait events considerably more
> expensive. Going from ~4 instructions to something more like ~40.  Including
> external function calls.
Yeah looking at the performance impacts and discuss those was in my todo (as
mentioned above), thanks for looking at it! 
> end wait:
>    0x0000000000a08eaa <+90>:	mov    (%r12),%rax
Yeah this one is unbeatable, it will have to be more expensive...
On my side, I'm getting:
=> 0x000000000091a320 <+0>:     mov    0x5337f9(%rip),%rax        # 0xe4db20 <my_wait_event_info>
   0x000000000091a327 <+7>:     mov    (%rax),%eax
   0x000000000091a329 <+9>:     mov    %eax,%edx
   0x000000000091a32b <+11>:    and    $0xff000000,%edx
   0x000000000091a331 <+17>:    jne    0x91a340 <waitEventIncrementCounter+32>
   0x000000000091a333 <+19>:    test   %ax,%ax
   0x000000000091a336 <+22>:    jne    0x91a340 <waitEventIncrementCounter+32>
   0x000000000091a338 <+24>:    ret
   0x000000000091a339 <+25>:    nopl   0x0(%rax)
   0x000000000091a340 <+32>:    cmp    $0x1000000,%edx
   0x000000000091a346 <+38>:    jne    0x91a34e <waitEventIncrementCounter+46>
   0x000000000091a348 <+40>:    cmp    $0x5e,%ax
   0x000000000091a34c <+44>:    ja     0x91a338 <waitEventIncrementCounter+24>
   0x000000000091a34e <+46>:    cmp    $0x7000000,%edx
   0x000000000091a354 <+52>:    sete   %cl
   0x000000000091a357 <+55>:    test   %ax,%ax
   0x000000000091a35a <+58>:    setne  %al
   0x000000000091a35d <+61>:    test   %al,%cl
   0x000000000091a35f <+63>:    jne    0x91a338 <waitEventIncrementCounter+24>
   0x000000000091a361 <+65>:    cmp    $0xb000000,%edx
   0x000000000091a367 <+71>:    je     0x91a338 <waitEventIncrementCounter+24>
   0x000000000091a369 <+73>:    mov    %edi,%eax
   0x000000000091a36b <+75>:    movzwl %di,%edi
   0x000000000091a36e <+78>:    shr    $0x18,%eax
   0x000000000091a371 <+81>:    shl    $0x5,%rax
   0x000000000091a375 <+85>:    movslq 0xe4d9c8(%rax),%rax
   0x000000000091a37c <+92>:    movb   $0x1,0x56045d(%rip)        # 0xe7a7e0 <have_wait_event_stats>
   0x000000000091a383 <+99>:    add    %rdi,%rax
   0x000000000091a386 <+102>:   addq   $0x1,0xe7a800(,%rax,8)
   0x000000000091a38f <+111>:   ret
> That's about an order of magnitude increase in overhead, just based on the
> number of instructions, and probably considerably xmore in runtime (you're
> adding external function calls etc).
Right.
> I think to make this viable, we would have to remove some wait events and be
> more careful about adding them.
You mean incrementing the counters for wait events for which we expect that
the increment is "negligible" as compare to the wait event itself? (Like DATAFILE
IO and such)?
If so I agree and that's something I also have discussed with a few (also if we
add the timings later), that we could/should be more granular. If we go that way
we could also imagine a GUC to set the stats for all of them? (default being off).
> And optimize this code a good bit. This
> really shouldn't end up with this mouch code.
Yes, with the idea mentioned in [1] (code not shared yet), I end up with:
Dump of assembler code for function waitEventIncrementCounter:
=> 0x000000000091b320 <+0>:     mov    %edi,%eax
   0x000000000091b322 <+2>:     and    $0xff000000,%eax
   0x000000000091b327 <+7>:     jne    0x91b330 <waitEventIncrementCounter+16>
   0x000000000091b329 <+9>:     test   %di,%di
   0x000000000091b32c <+12>:    jne    0x91b330 <waitEventIncrementCounter+16>
   0x000000000091b32e <+14>:    ret
   0x000000000091b32f <+15>:    nop
   0x000000000091b330 <+16>:    cmp    $0x1000000,%eax
   0x000000000091b335 <+21>:    jne    0x91b33d <waitEventIncrementCounter+29>
   0x000000000091b337 <+23>:    cmp    $0x5e,%di
   0x000000000091b33b <+27>:    ja     0x91b32e <waitEventIncrementCounter+14>
   0x000000000091b33d <+29>:    cmp    $0x7000000,%eax
   0x000000000091b342 <+34>:    sete   %cl
   0x000000000091b345 <+37>:    test   %di,%di
   0x000000000091b348 <+40>:    setne  %dl
   0x000000000091b34b <+43>:    test   %dl,%cl
   0x000000000091b34d <+45>:    jne    0x91b32e <waitEventIncrementCounter+14>
   0x000000000091b34f <+47>:    cmp    $0xb000000,%eax
   0x000000000091b354 <+52>:    je     0x91b32e <waitEventIncrementCounter+14>
   0x000000000091b356 <+54>:    mov    %edi,%eax
   0x000000000091b358 <+56>:    movzwl %di,%edi
   0x000000000091b35b <+59>:    shr    $0x18,%eax
   0x000000000091b35e <+62>:    mov    0xe4e9c0(,%rax,8),%rax
   0x000000000091b366 <+70>:    movb   $0x1,0x5603f3(%rip)        # 0xe7b760 <have_wait_event_stats>
   0x000000000091b36d <+77>:    mov    0x8(%rax),%rax
   0x000000000091b371 <+81>:    addq   $0x1,(%rax,%rdi,8)
   0x000000000091b376 <+86>:    ret
That looks better before the first branch (could be done in v1 too) as it uses
registers only but looks worse when it's time to increment the counter while
the code is much simpler:
"
    WAIT_EVENT_INFO_DECODE(waitInfo, wait_event_info);
    AllWaitClasses[waitInfo.classId]->pendingCounts[eventId]++;
"
instead of
"
    counter = waitEventGetCounter(PendingWaitEventStats.counts, waitInfo.classId,
                                  waitInfo.eventId);
(*counter)++;
"
with the associated machinery in waitEventGetCounter().
Indeed, the waitEventGetCounter() produced only one memory read (and one write):
movslq 0xe4d9c8(%rax),%rax
addq   $0x1,0xe7a800(,%rax,8)
while the AllWaitClasses[waitInfo.classId]->pendingCounts[eventId]++ produces 2
pointer dereferences and 1 write:
mov    0xe4e9c0(,%rax,8),%rax
mov    0x8(%rax),%rax
addq   $0x1,(%rax,%rdi,8)
I still need to work on it and try to make the code simpler and effective.
[1]: https://www.postgresql.org/message-id/aG04Zk%2Bb3SRN0pBe%40ip-10-97-1-34.eu-west-3.compute.internal
Regards,
-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com
| From | Date | Subject | |
|---|---|---|---|
| Next Message | Jim Jones | 2025-07-09 07:27:51 | Re: XMLSerialize: version and explicit XML declaration | 
| Previous Message | Arseniy Mukhin | 2025-07-09 06:58:57 | Re: amcheck support for BRIN indexes |