Quick Links

64-bit wait_event and introduction of 32-bit wait_event_arg

From:	Jakub Wartak <jakub(dot)wartak(at)enterprisedb(dot)com>
To:	PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject:	64-bit wait_event and introduction of 32-bit wait_event_arg
Date:	2025-12-08 09:54:41
Message-ID:	CAKZiRmyKcTaeSGzMYDN6aRR-BwYGPeZbzDRKvGkJhxAghfb4LQ@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

Hi all,

We were debating internally if making transition to 64-bit wait_event
would be an acceptable idea (Robert's primary concern is that it may
be too limited info), but I had code to demo this, so let's just
discuss it further: After ensuring that 64-bit int math has same
performance characteristics as 32-bit one at least on x86_64, i've
converted our wait_event_info (32-bit today) to 64-bits while trying
to use pg atomics, then used some bit masking voodoo and got the lower
32-bit exposed as new wait_event_arg with some dumb demos. The idea is
to encode some specific (limited, but useful!) information into the
wait event variable itself, so we could gain access to additional
32-bit of space for details along with wait-event itself to help
assessment of some wait-event-related problems. This seems to probably
come without any performance impact, at least on reasonable platforms
used today in production (for ones with
PG_HAVE_8BYTE_SINGLE_COPY_ATOMICITY that is). Intended use pattern:
if I were chasing a certain specific wait_event-related problem, I
could extract certain info straight from wait_event_arg, making it
much easier than even drilling into other more advanced views (if
that's information exposed at all, often it's not).

Q0) Key question: does that sound like a good idea to pursue further
or are there any blockers to it?

Sample demos included in patch, depending on the specific wait_event,
wait_event_arg could be:

1. PgSleep could show time since it was launched (simplest thing one
can imagine, or we could think about time left maybe too?):

2. Passing exact relation oid on where we are waiting for (here pid
82242 was doing "alter table p3 add.." , but it's waiting for the
backend that executed "lock table p3 in exclusive mode;"). We can
decode wait_event right into relation (p3)

(then you would basically query pg_stat_replication for pid = 119689
as it seems to be the slowest one here)

4. DataFile could report fd (yes, it can differ from backend to
backend [due to fd cache], but it's demo, probably it would be better
with oid/relationNumber, but it's not fast to do that :) and although
we have dboid already, there's tablespace and dunno how we could
squeeze RelFileNumber with tablespace there, possibly we could just
use tablespace Oid there too)

5. (Challenging for me) Multixact Wait events - with wait_event_arg,
we could report where stuff is really waiting, right, now it's a
little guesswork, but with 0002 concept:

dbmultixact=# select pid, query, wait_event_type,
wait_event,wait_event_arg from pg_stat_activity where wait_event =
'MultiXactMemberSLRU';
pid | query
| wait_event_type | wait_event | wait_event_arg
-------+--------------------------------------------------------------------+-----------------+---------------------+----------------
99864 | INSERT INTO users (loc_id, fname) VALUES (2,'Testing
User-2-002'); | LWLock | MultiXactMemberSLRU | 16494

dbmultixact=# select 16494::regclass;
regclass
-----------
locations
dbmultixact=# \d users
[..]
"users_loc_id_fkey" FOREIGN KEY (loc_id) REFERENCES locations(loc_id)

The knowledge (for the end user) what is stored exactly in
wait_event_arg (depending on main wait_event) would be coming from
docs (probably some table). Probably each different wait_event could
be enhanced by some information.

Quick performance crosscheck of 0001 alone: /usr/pgsql19/bin/pgbench
-c 4 -P 1 -T 30 -S postgres:
master: tps = 121020.723246 (without initial connection time)
patched: tps = 121802.527000 (without initial connection time)

Q1) because we compile without -Wconversion, I was wondering if we
shouldn't need a safe/strict uint64 struct-like type that would catch
errors when stuff like uint64 return from WaitEventExtensionNew()
could be used externally by extensions with uint32? (because we do NOT
have -Wtruncation [too verbose?], any return value from uint64 that
will be casted silently to uint32 in extensions without any warning.
That may cause hangs during tests -- often tests wait for some
waitevent to show-up, but it wont).

Q2) 0002: Please ignore the 0002 quality, I did not want to sink more
time into MultiXact stuff, especially if the main concept would be
shot down. The main problem is how one can get RelFileNumber about
Relation that faces MultiXact back into LWLockReportWaitStart(). Here
I just wanted to see how much rework would be necessary (passing
variables, modifying API and so on) - in short: it introduces
LWLockAcquire() as fallback to LWLockAcquireExt(.. RelFileNumber r) ,
but still gets pretty nasty soon sadly, lots of stuff needs to be
dumb-adjusted. I would like to point out that I'm a complete
multixact/heapam noob, so it is a very dumb way of passing that info
for sure, in way too many places. Another thing we could do is
basically maybe have some "static uint32 lwlock_relation" inside
lwlock and properly just set it there (and reset it) once from within
heap*.c or similiar, so then all dependent LWLock routines would OR it
(== so it would be visible as wait_event_arg) and we would get the
involved RelFileNumber for all operations involved there (at least for
LWLocks).

While thinking about cons, the only cons that I could think of is that
when we would be exposing something as 32-bits , then if the following
major release changes some internal structure/data type to be a bit
more heavy, it couldn't be exposed anymore like that (think of e.g.
64-bit OIDs?)

Any help, opinions, ideas and code/co-authors are more than welcome.

-J.

[1] Disassembly picture of stock binary taken from PGDG on
Ubuntu/Debian x86_64 (so as used by real users), shows use of 64-bit
(rax) registry e.g. in AT&T mnemonics:
lea 0x6a4c1c(%rip), %rax // store into %rax value of my_wait_event_info
mov (%rax), %rax // dereference ptr %rax (and it put
back into rax)
mov %ebx, (%rax) // write 32-bit value of ebx into 64-bit rax

Or different, but very sample example with Intel mnemonics:
lea rax,[rip+0x865b09]
mov rax,QWORD PTR [rax] // notice it's already RAX and quadword
mov DWORD PTR [rax],0x0

[2] x86_64 linux, operations on eax vs rax, that's 1.00646 under
non-ideal conditions.
Benchmarking int32_t (32-bit) additions... Operations/Second
(int32_t): 3.37e+07
Benchmarking int64_t (64-bit) additions... Operations/Second
(int64_t): 3.38e+07

Attachment	Content-Type	Size
v3-0002-wait_event_arg-provide-quick-and-dump-demo-of-mul.patch	text/x-patch	30.5 KB
v3-0001-Convert-wait_event_info-to-64-bits-expose-lower-3.patch	text/x-patch	67.4 KB

Responses

Re: 64-bit wait_event and introduction of 32-bit wait_event_arg at 2025-12-08 10:12:27 from Heikki Linnakangas

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Amul Sul	2025-12-08 09:58:11	Re: alter check constraint enforceability
Previous Message	Amit Kapila	2025-12-08 09:51:40	Re: Proposal: Conflict log history table for Logical Replication