From: | Andrey Borodin <x4mmm(at)yandex-team(dot)ru> |
---|---|
To: | Dmitry <dsy(dot)075(at)yandex(dot)ru>, alvherre(at)kurilemu(dot)de |
Cc: | pgsql-hackers(at)lists(dot)postgresql(dot)org |
Subject: | Re: IPC/MultixactCreation on the Standby server |
Date: | 2025-06-30 10:58:44 |
Message-ID: | 3ECF4E63-1DB0-442B-B15A-78261FCA1869@yandex-team.ru |
Views: | Whole Thread | Raw Message | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
> On 28 Jun 2025, at 21:24, Andrey Borodin <x4mmm(at)yandex-team(dot)ru> wrote:
>
> This seems to be fixing issue for me.
ISTM I was wrong: there is a possible recovery conflict with snapshot.
REDO:
frame #2: 0x000000010179a0c8 postgres`pg_usleep(microsec=1000000) at pgsleep.c:50:10
frame #3: 0x000000010144c108 postgres`WaitExceedsMaxStandbyDelay(wait_event_info=134217772) at standby.c:248:2
frame #4: 0x000000010144a63c postgres`ResolveRecoveryConflictWithVirtualXIDs(waitlist=0x0000000126008200, reason=PROCSIG_RECOVERY_CONFLICT_SNAPSHOT, wait_event_info=134217772, report_waiting=true) at standby.c:384:8
frame #5: 0x000000010144a4f4 postgres`ResolveRecoveryConflictWithSnapshot(snapshotConflictHorizon=1214, isCatalogRel=false, locator=(spcOid = 1663, dbOid = 5, relNumber = 16384)) at standby.c:490:2
frame #6: 0x0000000100e4d3f8 postgres`heap_xlog_prune_freeze(record=0x0000000135808e60) at heapam.c:9208:4
frame #7: 0x0000000100e4d204 postgres`heap2_redo(record=0x0000000135808e60) at heapam.c:10353:4
frame #8: 0x0000000100f1548c postgres`ApplyWalRecord(xlogreader=0x0000000135808e60, record=0x0000000138058060, replayTLI=0x000000016f0425b0) at xlogrecovery.c:1991:2
frame #9: 0x0000000100f13ff0 postgres`PerformWalRecovery at xlogrecovery.c:1822:4
frame #10: 0x0000000100ef7940 postgres`StartupXLOG at xlog.c:5821:3
frame #11: 0x0000000101364334 postgres`StartupProcessMain(startup_data=0x0000000000000000, startup_data_len=0) at startup.c:258:2
SELECT:
frame #10: 0x0000000102a14684 postgres`GetMultiXactIdMembers(multi=278, members=0x000000016d4f9498, from_pgupgrade=false, isLockOnly=false) at multixact.c:1493:6
frame #11: 0x0000000102991814 postgres`MultiXactIdGetUpdateXid(xmax=278, t_infomask=4416) at heapam.c:7478:13
frame #12: 0x0000000102985450 postgres`HeapTupleGetUpdateXid(tuple=0x00000001043e5c60) at heapam.c:7519:9
frame #13: 0x00000001029a0360 postgres`HeapTupleSatisfiesMVCC(htup=0x000000016d4f9590, snapshot=0x000000015b07b930, buffer=69) at heapam_visibility.c:1090:10
frame #14: 0x000000010299fbc8 postgres`HeapTupleSatisfiesVisibility(htup=0x000000016d4f9590, snapshot=0x000000015b07b930, buffer=69) at heapam_visibility.c:1772:11
frame #15: 0x0000000102982954 postgres`page_collect_tuples(scan=0x000000014b009648, snapshot=0x000000015b07b930, page="", buffer=69, block=6, lines=228, all_visible=false, check_serializable=false) at heapam.c:480:12
page_collect_tuples() holds a lock on the buffer while examining tuples visibility, having InterruptHoldoffCount > 0. Tuple visibility check might need WAL to go on, we have to wait until some next MX be filled in.
Which might need a buffer lock or have a snapshot conflict with caller of page_collect_tuples().
Please find attached a dirty test, it reproduces problem my machine (startup deadlock, so when reproduced it takes 180s, normally passing in 10s).
Also, there is a fix: checking for recovery conflicts when falling back to case 2 MX read.
I do not feel comfortable with using interrupts while InterruptHoldoffCount > 0, so I need help from someone more knowledgeable about our interrupts machinery to tell if what I'm proposing is OK. (Álvaro?)
Also, I've modified the code to make race condition more reproducible.
multi = GetNewMultiXactId(nmembers, &offset);
// random sleep to make WAL order different order of usage on pages
if (rand()%2 == 0)
pg_usleep(1000);
(void) XLogInsert(RM_MULTIXACT_ID, XLOG_MULTIXACT_CREATE_ID);
Perhaps, I can build a fast injection points test if we want it.
Best regards, Andrey Borodin.
Attachment | Content-Type | Size |
---|---|---|
v5-0001-Make-next-multixact-sleep-timed-with-a-recovery-c.patch | application/octet-stream | 8.4 KB |
From | Date | Subject | |
---|---|---|---|
Next Message | Amit Kapila | 2025-06-30 11:22:34 | Re: Conflict detection for update_deleted in logical replication |
Previous Message | shveta malik | 2025-06-30 10:55:18 | Re: Skipping schema changes in publication |