Re: Adding wait events statistics

From: Bertrand Drouvot <bertranddrouvot(dot)pg(at)gmail(dot)com>
To: Andres Freund <andres(at)anarazel(dot)de>
Cc: pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject: Re: Adding wait events statistics
Date: 2025-07-09 07:26:05
Message-ID: aG4ZjQ1qKensHBOY@ip-10-97-1-34.eu-west-3.compute.internal
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

On Tue, Jul 08, 2025 at 10:19:07PM -0400, Andres Freund wrote:
> On 2025-06-30 13:36:12 +0000, Bertrand Drouvot wrote:
> > * we don't want to allocate memory in some places where the counters are
> > incremented (see 4feba03d8b9). We could still use the same implementation as for
> > backend statistics (i.e, make use of flush_static_cb) if we really want/need to
> > switch to variable stats.
>
> Even without allocating memory, this makes wait events considerably more
> expensive. Going from ~4 instructions to something more like ~40. Including
> external function calls.

Yeah looking at the performance impacts and discuss those was in my todo (as
mentioned above), thanks for looking at it!

> end wait:
> 0x0000000000a08eaa <+90>: mov (%r12),%rax

Yeah this one is unbeatable, it will have to be more expensive...

On my side, I'm getting:

=> 0x000000000091a320 <+0>: mov 0x5337f9(%rip),%rax # 0xe4db20 <my_wait_event_info>
0x000000000091a327 <+7>: mov (%rax),%eax
0x000000000091a329 <+9>: mov %eax,%edx
0x000000000091a32b <+11>: and $0xff000000,%edx
0x000000000091a331 <+17>: jne 0x91a340 <waitEventIncrementCounter+32>
0x000000000091a333 <+19>: test %ax,%ax
0x000000000091a336 <+22>: jne 0x91a340 <waitEventIncrementCounter+32>
0x000000000091a338 <+24>: ret
0x000000000091a339 <+25>: nopl 0x0(%rax)
0x000000000091a340 <+32>: cmp $0x1000000,%edx
0x000000000091a346 <+38>: jne 0x91a34e <waitEventIncrementCounter+46>
0x000000000091a348 <+40>: cmp $0x5e,%ax
0x000000000091a34c <+44>: ja 0x91a338 <waitEventIncrementCounter+24>
0x000000000091a34e <+46>: cmp $0x7000000,%edx
0x000000000091a354 <+52>: sete %cl
0x000000000091a357 <+55>: test %ax,%ax
0x000000000091a35a <+58>: setne %al
0x000000000091a35d <+61>: test %al,%cl
0x000000000091a35f <+63>: jne 0x91a338 <waitEventIncrementCounter+24>
0x000000000091a361 <+65>: cmp $0xb000000,%edx
0x000000000091a367 <+71>: je 0x91a338 <waitEventIncrementCounter+24>
0x000000000091a369 <+73>: mov %edi,%eax
0x000000000091a36b <+75>: movzwl %di,%edi
0x000000000091a36e <+78>: shr $0x18,%eax
0x000000000091a371 <+81>: shl $0x5,%rax
0x000000000091a375 <+85>: movslq 0xe4d9c8(%rax),%rax
0x000000000091a37c <+92>: movb $0x1,0x56045d(%rip) # 0xe7a7e0 <have_wait_event_stats>
0x000000000091a383 <+99>: add %rdi,%rax
0x000000000091a386 <+102>: addq $0x1,0xe7a800(,%rax,8)
0x000000000091a38f <+111>: ret

> That's about an order of magnitude increase in overhead, just based on the
> number of instructions, and probably considerably xmore in runtime (you're
> adding external function calls etc).

Right.

> I think to make this viable, we would have to remove some wait events and be
> more careful about adding them.

You mean incrementing the counters for wait events for which we expect that
the increment is "negligible" as compare to the wait event itself? (Like DATAFILE
IO and such)?

If so I agree and that's something I also have discussed with a few (also if we
add the timings later), that we could/should be more granular. If we go that way
we could also imagine a GUC to set the stats for all of them? (default being off).

> And optimize this code a good bit. This
> really shouldn't end up with this mouch code.

Yes, with the idea mentioned in [1] (code not shared yet), I end up with:

Dump of assembler code for function waitEventIncrementCounter:
=> 0x000000000091b320 <+0>: mov %edi,%eax
0x000000000091b322 <+2>: and $0xff000000,%eax
0x000000000091b327 <+7>: jne 0x91b330 <waitEventIncrementCounter+16>
0x000000000091b329 <+9>: test %di,%di
0x000000000091b32c <+12>: jne 0x91b330 <waitEventIncrementCounter+16>
0x000000000091b32e <+14>: ret
0x000000000091b32f <+15>: nop
0x000000000091b330 <+16>: cmp $0x1000000,%eax
0x000000000091b335 <+21>: jne 0x91b33d <waitEventIncrementCounter+29>
0x000000000091b337 <+23>: cmp $0x5e,%di
0x000000000091b33b <+27>: ja 0x91b32e <waitEventIncrementCounter+14>
0x000000000091b33d <+29>: cmp $0x7000000,%eax
0x000000000091b342 <+34>: sete %cl
0x000000000091b345 <+37>: test %di,%di
0x000000000091b348 <+40>: setne %dl
0x000000000091b34b <+43>: test %dl,%cl
0x000000000091b34d <+45>: jne 0x91b32e <waitEventIncrementCounter+14>
0x000000000091b34f <+47>: cmp $0xb000000,%eax
0x000000000091b354 <+52>: je 0x91b32e <waitEventIncrementCounter+14>
0x000000000091b356 <+54>: mov %edi,%eax
0x000000000091b358 <+56>: movzwl %di,%edi
0x000000000091b35b <+59>: shr $0x18,%eax
0x000000000091b35e <+62>: mov 0xe4e9c0(,%rax,8),%rax
0x000000000091b366 <+70>: movb $0x1,0x5603f3(%rip) # 0xe7b760 <have_wait_event_stats>
0x000000000091b36d <+77>: mov 0x8(%rax),%rax
0x000000000091b371 <+81>: addq $0x1,(%rax,%rdi,8)
0x000000000091b376 <+86>: ret

That looks better before the first branch (could be done in v1 too) as it uses
registers only but looks worse when it's time to increment the counter while
the code is much simpler:

"
WAIT_EVENT_INFO_DECODE(waitInfo, wait_event_info);

AllWaitClasses[waitInfo.classId]->pendingCounts[eventId]++;
"

instead of

"
counter = waitEventGetCounter(PendingWaitEventStats.counts, waitInfo.classId,
waitInfo.eventId);

(*counter)++;

"

with the associated machinery in waitEventGetCounter().

Indeed, the waitEventGetCounter() produced only one memory read (and one write):

movslq 0xe4d9c8(%rax),%rax
addq $0x1,0xe7a800(,%rax,8)

while the AllWaitClasses[waitInfo.classId]->pendingCounts[eventId]++ produces 2
pointer dereferences and 1 write:

mov 0xe4e9c0(,%rax,8),%rax
mov 0x8(%rax),%rax
addq $0x1,(%rax,%rdi,8)

I still need to work on it and try to make the code simpler and effective.

[1]: https://www.postgresql.org/message-id/aG04Zk%2Bb3SRN0pBe%40ip-10-97-1-34.eu-west-3.compute.internal

Regards,

--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Jim Jones 2025-07-09 07:27:51 Re: XMLSerialize: version and explicit XML declaration
Previous Message Arseniy Mukhin 2025-07-09 06:58:57 Re: amcheck support for BRIN indexes