From: | Robert Haas <robertmhaas(at)gmail(dot)com> |
---|---|
To: | Bertrand Drouvot <bertranddrouvot(dot)pg(at)gmail(dot)com> |
Cc: | Andres Freund <andres(at)anarazel(dot)de>, pgsql-hackers(at)lists(dot)postgresql(dot)org |
Subject: | Re: Adding wait events statistics |
Date: | 2025-07-29 14:01:46 |
Message-ID: | CA+TgmoYJRAzUAFX_FGrNRBxsW+5_TNPJ2rHbO=zbf2kXXC5EYQ@mail.gmail.com |
Views: | Whole Thread | Raw Message | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
On Mon, Jul 28, 2025 at 8:35 AM Bertrand Drouvot
<bertranddrouvot(dot)pg(at)gmail(dot)com> wrote:
> Focusing on LWLocks, I think that for example the OidGenLock one is one
> it would be interesting to have the duration for. That way one could start to
> investigate if there is a very long run of used OID values with no gaps in
> TOAST tables or if there is higher concurrency. I do agree that might be not be
> useful for all the LWLocks (those for which the users have no actionable
> workaround/solutions) but for some I think it could be useful (like the one
> mentioned above).
Yeah, maybe. That's an interesting example. One could make the
argument that we ought to have a system that specifically tests for
OID oversaturation, but admittedly that feels pretty special-purpose.
But on the other hand, I still don't think timing every LWLock wait is
going to work out well from a performance standpoint.
> Outside of LWLocks, I think it makes sense for heavyweight locks to answer
> questions like: Is the locker holding a lock for longer?
I think the existing log_lock_waits does a pretty good job with this case.
> I see, you are saying that a LOT of wait events could add a LOT of little
> cycles overhead which could lead to much more overhead than those 66 cycles.
Yes.
> That's right, that's why the idea was to add the counters and timings only on
> wait classes for which the counters and durations overhead was relatively small
> in regard to the waits themself.
>
> From [1]:
>
> "
> Overhead on the lock class is about 0.03%
> Overhead on the timeout class is less than 0.01%
>
> and now we can also see that:
>
> Overhead on the lwlock class is about 1%
> Overhead on the client class is about 0.5%
> Overhead on the bufferpin class is about 0.2%
> "
I have a hard time reconciling this with the overhead of EXPLAIN
ANALYZE, which we know to be much larger than this. To be fair, it's
quite possible that we switch between executor nodes a lot more often
than we go off-CPU, so maybe the wait event instrumentation is just a
lot cheaper for that reason. But I'm still suspicious that something
is wrong with these measurements: that you're testing on a platform
where timer events are particularly cheap, or a workload where they
are not as common as some workloads, or, I don't really know exactly.
Even if the overhead for measuring LWLock wait events is truly only
1%, I'm skeptical about whether that's worth it for the average user.
At that level, it might be OK as an optional feature that people can
enable if they need it.
But as I've said before, it's super-important that we don't get
ourselves into a situation where hackers don't wait add wait events to
places that should have them for fear of the overhead being too large.
We have mostly avoided that problem up until now, as far as I am
aware.
> That's right, the ones you would spot would be hot buffers but I think that's
> also possible that you missed some (the smaller the sampling interval is, the
> better though).
But like ... who cares? Nobody needs an absolutely exhaustive list of
buffers above a certain level of hotness.
> But still without the durations you can't say if the ones holding the buffers are
> holding longer than before or if there is an increase of concurrency.
I bet you can make a pretty good guess by looking at how many active
backends have any sort of heavyweight lock on the relation in
question. I have a lot of difficulty imagining this as a real point of
confusion in a customer troubleshooting situation.
I mean, theoretically, you could imagine a situation where the
concurrency on a relation is high but you can't tell whether they're
all focusing on the same buffers for shorter times or different
buffers for longer times. But I don't really see how this could occur
in practice. If it's an index, the contention almost has to be on the
root/upper-level pages; and if it's a heap you're either doing
sequential scans and the access pattern is uniform or you're doing
some kind of index scan and the contention is probably focused on the
upper-level index pages rather than the heap pages. You could maybe
imagine some crazy corner case where there's heap contention created
by TID scans, but that's too obscure a situation to justify adding
machinery of this kind.
--
Robert Haas
EDB: http://www.enterprisedb.com
From | Date | Subject | |
---|---|---|---|
Next Message | Ilia Evdokimov | 2025-07-29 14:07:13 | Re: Use merge-based matching for MCVs in eqjoinsel |
Previous Message | Jim Jones | 2025-07-29 14:01:19 | Re: Regression with large XML data input |