Quick Links

Re: Adding wait events statistics

From:	Andres Freund <andres(at)anarazel(dot)de>
To:	Bertrand Drouvot <bertranddrouvot(dot)pg(at)gmail(dot)com>
Cc:	pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject:	Re: Adding wait events statistics
Date:	2025-07-09 15:19:00
Message-ID:	xuynb44ql3hhggvqtme7axbliww7gwuy6pbaohxc4ngu3ynbsi@rvvpjxf55aia
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

Hi,

On 2025-07-09 07:26:05 +0000, Bertrand Drouvot wrote:
> On Tue, Jul 08, 2025 at 10:19:07PM -0400, Andres Freund wrote:
> > On 2025-06-30 13:36:12 +0000, Bertrand Drouvot wrote:
> > > * we don't want to allocate memory in some places where the counters are
> > > incremented (see 4feba03d8b9). We could still use the same implementation as for
> > > backend statistics (i.e, make use of flush_static_cb) if we really want/need to
> > > switch to variable stats.
> >
> > Even without allocating memory, this makes wait events considerably more
> > expensive. Going from ~4 instructions to something more like ~40. Including
> > external function calls.
>
> Yeah looking at the performance impacts and discuss those was in my todo (as
> mentioned above), thanks for looking at it!

TBH, the more I think about this, the less convinced a pure counter provides
*any* additional value over stats we already have, in which case the overhead
is moot.

We use commonly use wait events unconditionally around system calls, e.g.:

pgstat_report_wait_start(wait_event_info);
returnCode = pg_preadv(vfdP->fd, iov, iovcnt, offset);
pgstat_report_wait_end();

A counter for the number of wait events is just less informative version of
what we already have in pg_stat_io. There's literally no value provided by
tracking the number of times that wait event is hit.

A pretty large number of wait events operate that way. For those the
interesting information would be to measure *how long* you are in the wait
event, not the number of times the wait event was entered for a very short
amount of time.

The obvious counter to that is that you likely are planning to add timing
information at some point down the line - leaving overhead aside, that would
*also* just duplicate information for the above case, namely track_io_timing.

The wait events we have as-is are useful because they tell you what's
currently happening, and you can correlate that with other things that are
going on. But a pure counter of wait events encountered doesn't allow for
that.

> > end wait:
> > 0x0000000000a08eaa <+90>: mov (%r12),%rax
>
> Yeah this one is unbeatable, it will have to be more expensive...
>
> On my side, I'm getting:
>
> => 0x000000000091a320 <+0>: mov 0x5337f9(%rip),%rax # 0xe4db20 <my_wait_event_info>
> 0x000000000091a327 <+7>: mov (%rax),%eax
> 0x000000000091a329 <+9>: mov %eax,%edx
> 0x000000000091a32b <+11>: and $0xff000000,%edx
> 0x000000000091a331 <+17>: jne 0x91a340 <waitEventIncrementCounter+32>
> 0x000000000091a333 <+19>: test %ax,%ax
> 0x000000000091a336 <+22>: jne 0x91a340 <waitEventIncrementCounter+32>
> 0x000000000091a338 <+24>: ret
> 0x000000000091a339 <+25>: nopl 0x0(%rax)
> 0x000000000091a340 <+32>: cmp $0x1000000,%edx
> 0x000000000091a346 <+38>: jne 0x91a34e <waitEventIncrementCounter+46>
> 0x000000000091a348 <+40>: cmp $0x5e,%ax
> 0x000000000091a34c <+44>: ja 0x91a338 <waitEventIncrementCounter+24>
> 0x000000000091a34e <+46>: cmp $0x7000000,%edx
> 0x000000000091a354 <+52>: sete %cl
> 0x000000000091a357 <+55>: test %ax,%ax
> 0x000000000091a35a <+58>: setne %al
> 0x000000000091a35d <+61>: test %al,%cl
> 0x000000000091a35f <+63>: jne 0x91a338 <waitEventIncrementCounter+24>
> 0x000000000091a361 <+65>: cmp $0xb000000,%edx
> 0x000000000091a367 <+71>: je 0x91a338 <waitEventIncrementCounter+24>
> 0x000000000091a369 <+73>: mov %edi,%eax
> 0x000000000091a36b <+75>: movzwl %di,%edi
> 0x000000000091a36e <+78>: shr $0x18,%eax
> 0x000000000091a371 <+81>: shl $0x5,%rax
> 0x000000000091a375 <+85>: movslq 0xe4d9c8(%rax),%rax
> 0x000000000091a37c <+92>: movb $0x1,0x56045d(%rip) # 0xe7a7e0 <have_wait_event_stats>
> 0x000000000091a383 <+99>: add %rdi,%rax
> 0x000000000091a386 <+102>: addq $0x1,0xe7a800(,%rax,8)
> 0x000000000091a38f <+111>: ret

I think adding 6 conditional branches to this is just out of question...

> > I think to make this viable, we would have to remove some wait events and be
> > more careful about adding them.
>
> You mean incrementing the counters for wait events for which we expect that
> the increment is "negligible" as compare to the wait event itself? (Like DATAFILE
> IO and such)?

I mean that we would need to either *remove* a bunch of wait events, or change
the API so that there are "cheap" wait events that just have the current
overhead, and expensive wait events that do this tracking.

> If so I agree and that's something I also have discussed with a few (also if we
> add the timings later), that we could/should be more granular. If we go that way
> we could also imagine a GUC to set the stats for all of them? (default being
> off).

Any conditional code that you add makes this stuff more expensive...

> > And optimize this code a good bit. This
> > really shouldn't end up with this mouch code.
>
> Yes, with the idea mentioned in [1] (code not shared yet), I end up with:
>
> Dump of assembler code for function waitEventIncrementCounter:
> => 0x000000000091b320 <+0>: mov %edi,%eax
> 0x000000000091b322 <+2>: and $0xff000000,%eax
> 0x000000000091b327 <+7>: jne 0x91b330 <waitEventIncrementCounter+16>
> 0x000000000091b329 <+9>: test %di,%di
> 0x000000000091b32c <+12>: jne 0x91b330 <waitEventIncrementCounter+16>
> 0x000000000091b32e <+14>: ret
> 0x000000000091b32f <+15>: nop
> 0x000000000091b330 <+16>: cmp $0x1000000,%eax
> 0x000000000091b335 <+21>: jne 0x91b33d <waitEventIncrementCounter+29>
> 0x000000000091b337 <+23>: cmp $0x5e,%di
> 0x000000000091b33b <+27>: ja 0x91b32e <waitEventIncrementCounter+14>
> 0x000000000091b33d <+29>: cmp $0x7000000,%eax
> 0x000000000091b342 <+34>: sete %cl
> 0x000000000091b345 <+37>: test %di,%di
> 0x000000000091b348 <+40>: setne %dl
> 0x000000000091b34b <+43>: test %dl,%cl
> 0x000000000091b34d <+45>: jne 0x91b32e <waitEventIncrementCounter+14>
> 0x000000000091b34f <+47>: cmp $0xb000000,%eax
> 0x000000000091b354 <+52>: je 0x91b32e <waitEventIncrementCounter+14>
> 0x000000000091b356 <+54>: mov %edi,%eax
> 0x000000000091b358 <+56>: movzwl %di,%edi
> 0x000000000091b35b <+59>: shr $0x18,%eax
> 0x000000000091b35e <+62>: mov 0xe4e9c0(,%rax,8),%rax
> 0x000000000091b366 <+70>: movb $0x1,0x5603f3(%rip) # 0xe7b760 <have_wait_event_stats>
> 0x000000000091b36d <+77>: mov 0x8(%rax),%rax
> 0x000000000091b371 <+81>: addq $0x1,(%rax,%rdi,8)
> 0x000000000091b376 <+86>: ret

That's still way too many jumps.

> That looks better before the first branch (could be done in v1 too) as it uses
> registers only but looks worse when it's time to increment the counter while
> the code is much simpler:
>
> "
> WAIT_EVENT_INFO_DECODE(waitInfo, wait_event_info);
>
> AllWaitClasses[waitInfo.classId]->pendingCounts[eventId]++;
> "

Too many indirections. The increment should be doable without needing to
dereference intermediary memory locations.

Greetings,

Andres Freund

In response to

Re: Adding wait events statistics at 2025-07-09 07:26:05 from Bertrand Drouvot

Responses

Re: Adding wait events statistics at 2025-07-10 08:21:56 from Bertrand Drouvot

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Laurenz Albe	2025-07-09 15:26:02	Re: analyze-in-stages post upgrade questions
Previous Message	Andres Freund	2025-07-09 14:59:41	Re: AIO v2.5