| From: | Ayush Tiwari <ayushtiwari(dot)slg01(at)gmail(dot)com> |
|---|---|
| To: | Radim Marek <radim(at)boringsql(dot)com>, Andrey Borodin <x4mmm(at)yandex-team(dot)ru>, Heikki Linnakangas <hlinnaka(at)iki(dot)fi> |
| Cc: | Marko Tiikkaja <marko(at)joh(dot)to>, PostgreSQL mailing lists <pgsql-bugs(at)lists(dot)postgresql(dot)org> |
| Subject: | Re: BUG #19490: Streaming standby on 16.14 stops applying WAL on MultiXactOffsetSLRU when primary is 16.8 |
| Date: | 2026-05-22 16:51:32 |
| Message-ID: | CAJTYsWU6tdEvVh4YKLxz7+amZ7+Wb7_s-FBjsMMeLNj1fKeSNg@mail.gmail.com |
| Views: | Whole Thread | Raw Message | Download mbox | Resend email |
| Thread: | |
| Lists: | pgsql-bugs |
Hi,
On Thu, 21 May 2026 at 14:36, Radim Marek <radim(at)boringsql(dot)com> wrote:
> Altough the culprit is known, I've got more data as requested.
>
> #0 0x00007f20e9bdb687 in ?? () from /lib/x86_64-linux-gnu/libc.so.6
> #1 0x00007f20e9bdbc8c in ?? () from /lib/x86_64-linux-gnu/libc.so.6
> #2 0x00007f20e9be6920 in ?? () from /lib/x86_64-linux-gnu/libc.so.6
> #3 0x000055a71796e3ca in PGSemaphoreLock (sema=0x7f20de6d0e38) at
> ./build/src/backend/port/pg_sema.c:327
> #4 0x000055a7179f57ed in LWLockAcquire (lock=0x7f20de6d1800,
> mode=mode(at)entry=LW_EXCLUSIVE) at
> ./build/../src/backend/storage/lmgr/lwlock.c:1314
> #5 0x000055a71772dfb2 in SimpleLruWriteAll (ctl=ctl(at)entry=0x55a717e83040
> <MultiXactOffsetCtlData>, allow_redirtied=allow_redirtied(at)entry=false) at
> ./build/../src/backend/access/transam/slru.c:1174
> #6 0x000055a717727b6f in RecordNewMultiXact (multi=79871, offset=218449,
> nmembers=2, members=members(at)entry=0x7f20de6831ec) at
> ./build/../src/backend/access/transam/multixact.c:944
> #7 0x000055a71772a983 in multixact_redo (record=0x55a73a8d0fc8) at
> ./build/../src/backend/access/transam/multixact.c:3464
> #8 0x000055a71774d9b8 in ApplyWalRecord (xlogreader=<optimized out>,
> record=0x7f20de6831b0, replayTLI=<synthetic pointer>) at
> ./build/../src/backend/access/transam/xlogrecovery.c:1951
> #9 PerformWalRecovery () at
> ./build/../src/backend/access/transam/xlogrecovery.c:1782
> #10 0x000055a717740def in StartupXLOG () at
> ./build/../src/backend/access/transam/xlog.c:5452
> #11 0x000055a71797c7e4 in StartupProcessMain () at
> ./build/../src/backend/postmaster/startup.c:282
> #12 0x000055a717972b20 in AuxiliaryProcessMain (auxtype=auxtype(at)entry=StartupProcess)
> at ./build/../src/backend/postmaster/auxprocess.c:141
> #13 0x000055a717977db3 in StartChildProcess (type=StartupProcess) at
> ./build/../src/backend/postmaster/postmaster.c:5381
> #14 0x000055a71797bfb8 in PostmasterMain (argc=argc(at)entry=1,
> argv=argv(at)entry=0x55a73a8d0590) at
> ./build/../src/backend/postmaster/postmaster.c:1463
> #15 0x000055a7176a05bc in main (argc=1, argv=0x55a73a8d0590) at
> ./build/../src/backend/main/main.c:200
>
> and WAL dump
>
> rmgr: Btree len (rec/tot): 64/ 64, tx: 336098, lsn:
> 1/32DE75F0, prev 1/32DE7580, desc: INSERT_LEAF off: 244, blkref #0: rel
> 1663/16384/16432 blk 536
> rmgr: MultiXact len (rec/tot): 54/ 54, tx: 336098, lsn:
> 1/32DE7630, prev 1/32DE75F0, desc: CREATE_ID 79871 offset 218449 nmembers
> 2: 336089 (keysh)
> 336098 (keysh)
> rmgr: Heap len (rec/tot): 54/ 54, tx: 336098, lsn:
> 1/32DE7668, prev 1/32DE7630, desc: LOCK xmax: 79871, off: 1, infobits:
> [IS_MULTI, LOCK_ONLY,
> KEYSHR_LOCK], flags: 0x00, blkref #0: rel 1663/16384/16418 blk 0
> rmgr: Heap len (rec/tot): 72/ 72, tx: 336096, lsn:
> 1/32DE76A0, prev 1/32DE7668, desc: HOT_UPDATE old_xmax: 336096, old_off:
> 52, old_infobits: [],
> flags: 0x20, new_xmax: 0, new_off: 149, blkref #0: rel 1663/16384/16401
> blk 22
> rmgr: Heap len (rec/tot): 71/ 71, tx: 336096, lsn:
> 1/32DE76E8, prev 1/32DE76A0, desc: HOT_UPDATE old_xmax: 336096, old_off:
> 149, old_infobits: [],
> flags: 0x60, new_xmax: 0, new_off: 209, blkref #0: rel 1663/16384/16399
> blk 6
> rmgr: Heap len (rec/tot): 79/ 79, tx: 336096, lsn:
> 1/32DE7730, prev 1/32DE76E8, desc: INSERT off: 150, flags: 0x00, blkref #0:
> rel 1663/16384/16417
> blk 741
> rmgr: Heap len (rec/tot): 72/ 72, tx: 336097, lsn:
> 1/32DE7780, prev 1/32DE7730, desc: HOT_UPDATE old_xmax: 336097, old_off:
> 243, old_infobits: [],
> flags: 0x20, new_xmax: 0, new_off: 228, blkref #0: rel 1663/16384/16401
> blk 26
> rmgr: Transaction len (rec/tot): 34/ 34, tx: 336096, lsn:
> 1/32DE77C8, prev 1/32DE7780, desc: COMMIT 2026-05-21 08:43:07.003572 UTC
>
> Radim
>
Thanks for the additional backtrace and WAL dump. That makes the failure
mode much clearer.
The latest trace shows the startup process here:
SimpleLruWriteAll(MultiXactOffsetCtl, false)
RecordNewMultiXact(multi=79871, offset=218449, nmembers=2, ...)
multixact_redo()
The WAL dump also shows the matching record:
rmgr: MultiXact ... desc: CREATE_ID 79871 offset 218449 nmembers 2
79871 is the last multixact on its offsets page, so replaying that record
enters the next_pageno != pageno compatibility path added by 77dff5d937b.
On REL_14 through REL_16, RecordNewMultiXact() already holds
MultiXactOffsetSLRULock while executing that code. SimpleLruWriteAll() then
tries to acquire MultiXactOffsetCtl's SLRU control lock, which is the same
MultiXactOffsetSLRULock on those branches. That explains the standby
startup
process waiting forever on LWLock:MultiXactOffsetSLRU, with no corresponding
SLRU I/O activity.
I think the right fix is to remove that SimpleLruWriteAll() call while
keeping the missing-page initialization logic. The flush is only meant to
make SimpleLruDoesPhysicalPageExist() see pages that exist in SLRU buffers
but have not reached disk. In this fallback path, I don't see a way for
the tested next_pageno to be in that state: if RecordNewMultiXact() itself
initializes the page, it writes it synchronously with SimpleLruWritePage()
before setting last_initialized_offsets_page.
I attached a small patch for REL_16_STABLE. The same self-deadlock pattern
is also present on PG 14 and 15. PG 17 and
18 have the same compatibility call, but SLRU locking is banked
there, and RecordNewMultiXact() does not appear to hold the relevant bank
lock before calling SimpleLruWriteAll(), so I would not describe those
branches as having this exact self-deadlock, but needs more analysis.
Added both Andrey and Heikki in to-mail, since I'm not sure if this
is more extreme than the multixact offset issue we had with 16.12, or it
is at par with that.
Regards,
Ayush
| Attachment | Content-Type | Size |
|---|---|---|
| v1-0001-Avoid-self-deadlock-on-MultiXactOffsetSLRULock-dur.patch | application/octet-stream | 2.5 KB |
| From | Date | Subject | |
|---|---|---|---|
| Next Message | Matheus Alcantara | 2026-05-22 20:56:05 | Re: BUG #19484: Segmentation fault triggered by FDW |
| Previous Message | Kyle Kingsbury | 2026-05-22 16:44:10 | Possible G2-item at SERIALIZABLE |