Re: Adding wait events statistics

From: Andres Freund <andres(at)anarazel(dot)de>
To: Bertrand Drouvot <bertranddrouvot(dot)pg(at)gmail(dot)com>
Cc: pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject: Re: Adding wait events statistics
Date: 2025-07-22 14:07:30
Message-ID: 7wh6dalioz2kxc43efxeiwgb6gjzhfq4hz6zxkggzpqopk57rp@ji22dyzvjem5
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

On 2025-07-22 12:24:46 +0000, Bertrand Drouvot wrote:
> Anyway, let's forget about eBPF, I ran another experiment by counting the cycles
> with:
>
> static inline uint64_t rdtsc(void) {
> uint32_t lo, hi;
> __asm__ __volatile__("rdtsc" : "=a" (lo), "=d" (hi));
> return ((uint64_t)hi << 32) | lo;
> }
>
> and then calling this function before and after waitEventIncrementCounter()
> and also at wait_start() and wait_end() (without the increment counters patchs).

I think you're still going to get massively increased baseline numbers that
way - the normal cost of a wait event is under 10 cycles. Doing two rdtscs
costs somewhere between 60-90 cycles. Which means that any increase due to
counters & timers will look a lot cheaper if compared to that increased
baseline, than if you compared to the actual current cost of a wait event.

> So that we can compare with the percentile cycles per wait events (see attached).
>
> We can see that, for those wait classes, all their wait events overhead would be
> < 5% and more precisely:
>
> Overhead on the lock class is about 0.03%
> Overhead on the timeout class is less than 0.01%
>
> and now we can also see that:
>
> Overhead on the lwlock class is about 1%
> Overhead on the client class is about 0.5%
> Overhead on the bufferpin class is about 0.2%

I think that's largely because there is relatively few such wait events,
because there is very very little contention in the regression tests and we
just don't do a whole lot intensive things in the tests. I suspect that at
least some of the high events here will actually be due to tests that
explicitly test the contention behaviour, and thus will have very high wait
times.

E.g. if you measure client timings, the overhead here will be fairly low,
because we're not going to be CPU bound by the back/forth between client and
server, and thus many of the waits will be longer. If you instead measure a
single client readonly pgbench, it'll look different. Similar, if you have
lwlock contention in a real world workload, most of the waits will be
incredibly short, but in our tests that will not necessarily be the case.

> while the io and ipc classes have mixed results.
>
> So based on the cycles metric I think it looks pretty safe to implement for the
> whole majority of classes.

This precisely is why I am scared of this effort. If you only look at it in
the right light, it'll look cheap, but in other cases it'll cause measureable
slowdowns.

> > I also continue to not believe that pure event counters are going to be useful
> > for the majority of wait events. I'm not sure it is really interesting for
> > *any* wait event that we don't already have independent stats for.
>
> For pure counters only I can see your point, but for counters + timings are you
> also not convinced?

For counters + timings I can see that it'd be useful. But i don't believe it's
close to as cheap as you say it is.

> > I think if we do want to have wait events that have more details, we need to:
> >
> > a) make them explicitly opt-in, i.e. code has to be changed over to use the
> > extended wait events
> > b) the extended wait events need to count both the number of encounters as
> > well as the duration, the number of encounters is not useful on its own
> > c) for each callsite that is converted to the extended wait event, you either
> > need to reason why the added overhead is ok, or do a careful experiment
> >
>
> I do agree with the above, what do you think about this lastest experiment counting
> the cycles?

I continue to not believe it at all, sorry. Even if the counting method were
accurate, you can't use our tests to measure the relative overhead, as they
aren't actually exercising the paths leading to waits

> > Personally I'd rather have an in-core sampling collector, counting how often
> > it sees certain wait events when sampling.
>
> Yeah but even if we are okay with losing "counters" by sampling, we'd still not get
> the duration. For the duration to be meaningful we also need the exact number
> of counters.

You don't need precise duration to see what wait events are a problem. If you
see that some event is samples a lot you know it's because there either are a
*lot* of those wait events or the wait events are entered into for a long
time.

Greetings,

Andres Freund

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Andrei Lepikhov 2025-07-22 14:35:38 Re: Add estimated hit ratio to Memoize in EXPLAIN to explain cost adjustment
Previous Message Vitale, Anthony, Sony Music 2025-07-22 14:03:23 RE: Question on any plans to use the User Server/User Mapping to provide Logical Replication Subscriptions the user/password in an encrypted manner