64-bit wait_event and introduction of 32-bit wait_event_arg

From: Jakub Wartak <jakub(dot)wartak(at)enterprisedb(dot)com>
To: PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: 64-bit wait_event and introduction of 32-bit wait_event_arg
Date: 2025-12-08 09:54:41
Message-ID: CAKZiRmyKcTaeSGzMYDN6aRR-BwYGPeZbzDRKvGkJhxAghfb4LQ@mail.gmail.com
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi all,

We were debating internally if making transition to 64-bit wait_event
would be an acceptable idea (Robert's primary concern is that it may
be too limited info), but I had code to demo this, so let's just
discuss it further: After ensuring that 64-bit int math has same
performance characteristics as 32-bit one at least on x86_64, i've
converted our wait_event_info (32-bit today) to 64-bits while trying
to use pg atomics, then used some bit masking voodoo and got the lower
32-bit exposed as new wait_event_arg with some dumb demos. The idea is
to encode some specific (limited, but useful!) information into the
wait event variable itself, so we could gain access to additional
32-bit of space for details along with wait-event itself to help
assessment of some wait-event-related problems. This seems to probably
come without any performance impact, at least on reasonable platforms
used today in production (for ones with
PG_HAVE_8BYTE_SINGLE_COPY_ATOMICITY that is). Intended use pattern:
if I were chasing a certain specific wait_event-related problem, I
could extract certain info straight from wait_event_arg, making it
much easier than even drilling into other more advanced views (if
that's information exposed at all, often it's not).

Q0) Key question: does that sound like a good idea to pursue further
or are there any blockers to it?

Sample demos included in patch, depending on the specific wait_event,
wait_event_arg could be:

1. PgSleep could show time since it was launched (simplest thing one
can imagine, or we could think about time left maybe too?):

pid | backend_type | wait_event_type | wait_event | wait_event_arg |
query
-------+----------------+-----------------+------------+----------------+----------------------------------------------------------------------------------------------------------------------------------
78317 | client backend | Timeout | PgSleep | 10
| select 'imagine complex stuff here dozes of kB SQL text
query,procedures, functions' as s, pg_sleep(10) as
embedded_internally;

2. Passing exact relation oid on where we are waiting for (here pid
82242 was doing "alter table p3 add.." , but it's waiting for the
backend that executed "lock table p3 in exclusive mode;"). We can
decode wait_event right into relation (p3)

postgres=# select pid, backend_type, wait_event_type, wait_event,
wait_event_arg, wait_event_arg::regclass, query from pg_stat_activity
where state = 'active' and (wait_event_type, wait_event) = ('Lock',
'relation');
pid | backend_type | wait_event_type | wait_event |
wait_event_arg | wait_event_arg | query
-------+----------------+-----------------+------------+----------------+----------------+--------------------------------
82242 | client backend | Lock | relation | 16467
| p3 | alter table p3 add id3 bigint;

3. IPC/SyncRep (SyncRepWaitForLSN()) could report PID of the slowest
walsender. This is useful in cases where multiple are involved to
pinpoint where you might be slow/stuck:
pid | application_name | wait_event_type | wait_event |
wait_event_arg | q
--------+------------------+-----------------+---------------+----------------+------------------------------------------
120318 | pgbench | IPC | SyncRep | 119689
| INSERT INTO child (parent_id, payload)
120319 | pgbench | IPC | SyncRep | 119689
| INSERT INTO child (parent_id, payload)
120320 | pgbench | IPC | SyncRep | 120248
| INSERT INTO child (parent_id, payload)
120321 | pgbench | IPC | SyncRep | 119689
| INSERT INTO child (parent_id, payload)
119689 | walreceiver2 | Activity | WalSenderMain | |
START_REPLICATION 0/DC000000 TIMELINE 1
120248 | walreceiver | Activity | WalSenderMain |
| START_REPLICATION 0/E2000000 TIMELINE 1

(then you would basically query pg_stat_replication for pid = 119689
as it seems to be the slowest one here)

4. DataFile could report fd (yes, it can differ from backend to
backend [due to fd cache], but it's demo, probably it would be better
with oid/relationNumber, but it's not fast to do that :) and although
we have dboid already, there's tablespace and dunno how we could
squeeze RelFileNumber with tablespace there, possibly we could just
use tablespace Oid there too)

pid | backend_type | wait_event_type | wait_event |
wait_event_arg | query
-------+----------------+-----------------+--------------+----------------+------------------------------------------------------------
77467 | client backend | IO | DataFileRead | 8 | SELECT
abalance FROM pgbench_accounts WHERE aid = 8657837;
77470 | client backend | IO | DataFileRead |11 | SELECT
abalance FROM pgbench_accounts WHERE aid = 6840630;

5. (Challenging for me) Multixact Wait events - with wait_event_arg,
we could report where stuff is really waiting, right, now it's a
little guesswork, but with 0002 concept:

dbmultixact=# select wait_event_type, wait_event, wait_event_arg,
count(*) from pg_stat_activity where state='active' group by
wait_event_type, wait_event, wait_event_arg order by 4 desc limit 5;
wait_event_type | wait_event | wait_event_arg | count
-----------------+---------------------+----------------+-------
LWLock | BufferContent | | 365
Lock | tuple | 16494 | 42
LWLock | MultiXactOffsetSLRU | 16494 | 13
Lock | transactionid | | 10
LWLock | MultiXactOffsetSLRU | | 9

dbmultixact=# select pid, query, wait_event_type,
wait_event,wait_event_arg from pg_stat_activity where wait_event =
'MultiXactMemberSLRU';
pid | query
| wait_event_type | wait_event | wait_event_arg
-------+--------------------------------------------------------------------+-----------------+---------------------+----------------
99864 | INSERT INTO users (loc_id, fname) VALUES (2,'Testing
User-2-002'); | LWLock | MultiXactMemberSLRU | 16494

dbmultixact=# select 16494::regclass;
regclass
-----------
locations
dbmultixact=# \d users
[..]
"users_loc_id_fkey" FOREIGN KEY (loc_id) REFERENCES locations(loc_id)

The knowledge (for the end user) what is stored exactly in
wait_event_arg (depending on main wait_event) would be coming from
docs (probably some table). Probably each different wait_event could
be enhanced by some information.

Quick performance crosscheck of 0001 alone: /usr/pgsql19/bin/pgbench
-c 4 -P 1 -T 30 -S postgres:
master: tps = 121020.723246 (without initial connection time)
patched: tps = 121802.527000 (without initial connection time)

Q1) because we compile without -Wconversion, I was wondering if we
shouldn't need a safe/strict uint64 struct-like type that would catch
errors when stuff like uint64 return from WaitEventExtensionNew()
could be used externally by extensions with uint32? (because we do NOT
have -Wtruncation [too verbose?], any return value from uint64 that
will be casted silently to uint32 in extensions without any warning.
That may cause hangs during tests -- often tests wait for some
waitevent to show-up, but it wont).

Q2) 0002: Please ignore the 0002 quality, I did not want to sink more
time into MultiXact stuff, especially if the main concept would be
shot down. The main problem is how one can get RelFileNumber about
Relation that faces MultiXact back into LWLockReportWaitStart(). Here
I just wanted to see how much rework would be necessary (passing
variables, modifying API and so on) - in short: it introduces
LWLockAcquire() as fallback to LWLockAcquireExt(.. RelFileNumber r) ,
but still gets pretty nasty soon sadly, lots of stuff needs to be
dumb-adjusted. I would like to point out that I'm a complete
multixact/heapam noob, so it is a very dumb way of passing that info
for sure, in way too many places. Another thing we could do is
basically maybe have some "static uint32 lwlock_relation" inside
lwlock and properly just set it there (and reset it) once from within
heap*.c or similiar, so then all dependent LWLock routines would OR it
(== so it would be visible as wait_event_arg) and we would get the
involved RelFileNumber for all operations involved there (at least for
LWLocks).

While thinking about cons, the only cons that I could think of is that
when we would be exposing something as 32-bits , then if the following
major release changes some internal structure/data type to be a bit
more heavy, it couldn't be exposed anymore like that (think of e.g.
64-bit OIDs?)

Any help, opinions, ideas and code/co-authors are more than welcome.

-J.

[1] Disassembly picture of stock binary taken from PGDG on
Ubuntu/Debian x86_64 (so as used by real users), shows use of 64-bit
(rax) registry e.g. in AT&T mnemonics:
lea 0x6a4c1c(%rip), %rax // store into %rax value of my_wait_event_info
mov (%rax), %rax // dereference ptr %rax (and it put
back into rax)
mov %ebx, (%rax) // write 32-bit value of ebx into 64-bit rax

Or different, but very sample example with Intel mnemonics:
lea rax,[rip+0x865b09]
mov rax,QWORD PTR [rax] // notice it's already RAX and quadword
mov DWORD PTR [rax],0x0

[2] x86_64 linux, operations on eax vs rax, that's 1.00646 under
non-ideal conditions.
Benchmarking int32_t (32-bit) additions... Operations/Second
(int32_t): 3.37e+07
Benchmarking int64_t (64-bit) additions... Operations/Second
(int64_t): 3.38e+07

Attachment Content-Type Size
v3-0002-wait_event_arg-provide-quick-and-dump-demo-of-mul.patch text/x-patch 30.5 KB
v3-0001-Convert-wait_event_info-to-64-bits-expose-lower-3.patch text/x-patch 67.4 KB

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Amul Sul 2025-12-08 09:58:11 Re: alter check constraint enforceability
Previous Message Amit Kapila 2025-12-08 09:51:40 Re: Proposal: Conflict log history table for Logical Replication