| From: | Radim Marek <radim(at)boringsql(dot)com> |
|---|---|
| To: | Andrey Borodin <x4mmm(at)yandex-team(dot)ru> |
| Cc: | Marko Tiikkaja <marko(at)joh(dot)to>, PostgreSQL mailing lists <pgsql-bugs(at)lists(dot)postgresql(dot)org> |
| Subject: | Re: BUG #19490: Streaming standby on 16.14 stops applying WAL on MultiXactOffsetSLRU when primary is 16.8 |
| Date: | 2026-05-21 09:06:18 |
| Message-ID: | CAJgoLkKCu0wCwPQZSo5no=XATU-4LMK4QfKBwV928o2uKcxe=g@mail.gmail.com |
| Views: | Whole Thread | Raw Message | Download mbox | Resend email |
| Thread: | |
| Lists: | pgsql-bugs |
Altough the culprit is known, I've got more data as requested.
#0 0x00007f20e9bdb687 in ?? () from /lib/x86_64-linux-gnu/libc.so.6
#1 0x00007f20e9bdbc8c in ?? () from /lib/x86_64-linux-gnu/libc.so.6
#2 0x00007f20e9be6920 in ?? () from /lib/x86_64-linux-gnu/libc.so.6
#3 0x000055a71796e3ca in PGSemaphoreLock (sema=0x7f20de6d0e38) at
./build/src/backend/port/pg_sema.c:327
#4 0x000055a7179f57ed in LWLockAcquire (lock=0x7f20de6d1800,
mode=mode(at)entry=LW_EXCLUSIVE) at
./build/../src/backend/storage/lmgr/lwlock.c:1314
#5 0x000055a71772dfb2 in SimpleLruWriteAll (ctl=ctl(at)entry=0x55a717e83040
<MultiXactOffsetCtlData>, allow_redirtied=allow_redirtied(at)entry=false) at
./build/../src/backend/access/transam/slru.c:1174
#6 0x000055a717727b6f in RecordNewMultiXact (multi=79871, offset=218449,
nmembers=2, members=members(at)entry=0x7f20de6831ec) at
./build/../src/backend/access/transam/multixact.c:944
#7 0x000055a71772a983 in multixact_redo (record=0x55a73a8d0fc8) at
./build/../src/backend/access/transam/multixact.c:3464
#8 0x000055a71774d9b8 in ApplyWalRecord (xlogreader=<optimized out>,
record=0x7f20de6831b0, replayTLI=<synthetic pointer>) at
./build/../src/backend/access/transam/xlogrecovery.c:1951
#9 PerformWalRecovery () at
./build/../src/backend/access/transam/xlogrecovery.c:1782
#10 0x000055a717740def in StartupXLOG () at
./build/../src/backend/access/transam/xlog.c:5452
#11 0x000055a71797c7e4 in StartupProcessMain () at
./build/../src/backend/postmaster/startup.c:282
#12 0x000055a717972b20 in AuxiliaryProcessMain
(auxtype=auxtype(at)entry=StartupProcess)
at ./build/../src/backend/postmaster/auxprocess.c:141
#13 0x000055a717977db3 in StartChildProcess (type=StartupProcess) at
./build/../src/backend/postmaster/postmaster.c:5381
#14 0x000055a71797bfb8 in PostmasterMain (argc=argc(at)entry=1,
argv=argv(at)entry=0x55a73a8d0590)
at ./build/../src/backend/postmaster/postmaster.c:1463
#15 0x000055a7176a05bc in main (argc=1, argv=0x55a73a8d0590) at
./build/../src/backend/main/main.c:200
and WAL dump
rmgr: Btree len (rec/tot): 64/ 64, tx: 336098, lsn:
1/32DE75F0, prev 1/32DE7580, desc: INSERT_LEAF off: 244, blkref #0: rel
1663/16384/16432 blk 536
rmgr: MultiXact len (rec/tot): 54/ 54, tx: 336098, lsn:
1/32DE7630, prev 1/32DE75F0, desc: CREATE_ID 79871 offset 218449 nmembers
2: 336089 (keysh)
336098 (keysh)
rmgr: Heap len (rec/tot): 54/ 54, tx: 336098, lsn:
1/32DE7668, prev 1/32DE7630, desc: LOCK xmax: 79871, off: 1, infobits:
[IS_MULTI, LOCK_ONLY,
KEYSHR_LOCK], flags: 0x00, blkref #0: rel 1663/16384/16418 blk 0
rmgr: Heap len (rec/tot): 72/ 72, tx: 336096, lsn:
1/32DE76A0, prev 1/32DE7668, desc: HOT_UPDATE old_xmax: 336096, old_off:
52, old_infobits: [],
flags: 0x20, new_xmax: 0, new_off: 149, blkref #0: rel 1663/16384/16401 blk
22
rmgr: Heap len (rec/tot): 71/ 71, tx: 336096, lsn:
1/32DE76E8, prev 1/32DE76A0, desc: HOT_UPDATE old_xmax: 336096, old_off:
149, old_infobits: [],
flags: 0x60, new_xmax: 0, new_off: 209, blkref #0: rel 1663/16384/16399 blk
6
rmgr: Heap len (rec/tot): 79/ 79, tx: 336096, lsn:
1/32DE7730, prev 1/32DE76E8, desc: INSERT off: 150, flags: 0x00, blkref #0:
rel 1663/16384/16417
blk 741
rmgr: Heap len (rec/tot): 72/ 72, tx: 336097, lsn:
1/32DE7780, prev 1/32DE7730, desc: HOT_UPDATE old_xmax: 336097, old_off:
243, old_infobits: [],
flags: 0x20, new_xmax: 0, new_off: 228, blkref #0: rel 1663/16384/16401 blk
26
rmgr: Transaction len (rec/tot): 34/ 34, tx: 336096, lsn:
1/32DE77C8, prev 1/32DE7780, desc: COMMIT 2026-05-21 08:43:07.003572 UTC
Radim
On Thu, 21 May 2026 at 10:34, Radim Marek <radim(at)boringsql(dot)com> wrote:
> Thank you for the follow-up. In mean-time I can confirm the
> commit 77dff5d937b1 might be the source of the original reported issue.
>
> Unfortunately pinning version down to 16.12 only avoids the
> MultiXactOffsetSLRU self-deadlock, but the standby then fails recovery
> after 12+ hours.
>
> FATAL: could not access status of transaction 24958976 DETAIL: Could not
> read from file "pg_multixact/offsets/017C" at offset 221184: read too few
> bytes. CONTEXT: WAL redo at 14770/873268E8 for MultiXact/CREATE_ID:
> 24958975 offset 61500431 nmembers 2: 3058927188 (fornokeyupd) 3058927189
> (keysh)
>
> We are going to try to pin 16.13 and try that before we can safely upgrade
> of the primary/are confident we have working PITR recovery available should
> we need it.
>
> Radim
>
> PS: Once I have some time I will try to setup a docker based harness to be
> able to replicate original problem for later testing of the fix.
>
> On Thu, 21 May 2026 at 09:25, Andrey Borodin <x4mmm(at)yandex-team(dot)ru> wrote:
>
>>
>>
>> > On 21 May 2026, at 00:12, Marko Tiikkaja <marko(at)joh(dot)to> wrote:
>> >
>> > #8 0x0000654c8ae2acba in SimpleLruWriteAll (ctl=0x654c8b63e400
>>
>> Thanks!
>>
>> This clearly points to SimpleLruWriteAll() added in 77dff5d937b1.
>> If by chance you will have a backtrace of another deadlocking process -
>> please post it.
>>
>> But it's not strictly necessary for analysis, I think we can figure out
>> what
>> happened from the backtrace you already posted.
>>
>>
>> Best regards, Andrey Borodin.
>>
>
| From | Date | Subject | |
|---|---|---|---|
| Previous Message | Radim Marek | 2026-05-21 08:34:49 | Re: BUG #19490: Streaming standby on 16.14 stops applying WAL on MultiXactOffsetSLRU when primary is 16.8 |