[BUG] FailedAssertion in SnapBuildPurgeOlderTxn

From: Maxim Orlov <orlovmg(at)gmail(dot)com>
To: Postgres hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: [BUG] FailedAssertion in SnapBuildPurgeOlderTxn
Date: 2022-11-21 12:47:12
Message-ID: CACG=ezZoz_KG+Ryh9MrU_g5e0HiVoHocEvqFF=NRrhrwKmEQJQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi!

PREAMBLE

For a last couple of months, I stumbled into a problem while running tests
on ARM (Debain, aarch64) and some more wired platforms
for my 64–bit XIDs patch set. Test contrib/test_decoding
(catalog_change_snapshot) rarely failed with the next message:

TRAP: FailedAssertion("TransactionIdIsNormal(InitialRunningXacts[0]) &&
TransactionIdIsNormal(builder->xmin)", File: "snapbuild.c"

I have plenty of failing on ARM, couple on x86 and none (if memory serves)
on x86–64.

At first, my thought was to blame my 64–bit XID patch for what, but this is
not the case. This error persist from PG15 to PG10
without any patch applied. Though hard to reproduce.

PROBLEM

After some investigation, I think, the problem is in the snapbuild.c
(commit 272248a0c1b1, see [0]). We do allocate InitialRunningXacts
array in the context of builder->context, but for the time when we call
SnapBuildPurgeOlderTxn this context may be already free'd. Thus,
InitialRunningXacts array become array of 2139062143 (i.e. 0x7F7F7F7F)
values. This is not caused buildfarm to fail due to that code:

if (!NormalTransactionIdPrecedes(InitialRunningXacts[0],
builder->xmin))
return;

Since the cluster is initialised with XID way less than 0x7F7F7F7F, we get
to return here, but the problem is still existing.
I've attached the patch based on branch REL_15_STABLE to reproduce the
problem on x86-64.

On my patch set of 64–bit XID's this problem manifested since I do init
cluster with XID far beyond 32–bit bound.

Alternatively, I did try to use my patch [1] to init cluster with first
transaction 2139062143 (i.e. 0x7F7F7F7F).
Then put pg_sleep call just like in the attached patch:
--- a/src/backend/replication/logical/snapbuild.c
+++ b/src/backend/replication/logical/snapbuild.c
@@ -968,6 +968,8 @@ SnapBuildPurgeOlderTxn(SnapBuild *builder)
if (NInitialRunningXacts == 0)
return;

+ pg_usleep(1000000L * 2L);
+
/* bound check if there is at least one transaction to remove */
if (!NormalTransactionIdPrecedes(InitialRunningXacts[0],

builder->xmin))

Run installcheck-force for many times for a test_decoding/
catalog_change_snapshot's and got a segfault.

CONCLUSION

In snapbuild.c, context allocated array InitialRunningXacts may be free'd,
this caused assertion failed (at best) or
may lead to the more serious problems.

P.S.

Simple fix like:
@@ -1377,7 +1379,7 @@ SnapBuildFindSnapshot(SnapBuild *builder, XLogRecPtr
lsn, xl_running_xacts *runn
* changes. See SnapBuildXidSetCatalogChanges.
*/
NInitialRunningXacts = nxacts;
- InitialRunningXacts = MemoryContextAlloc(builder->context,
sz);
+ InitialRunningXacts = MemoryContextAlloc(TopMemoryContext,
sz);
memcpy(InitialRunningXacts, running->xids, sz);
qsort(InitialRunningXacts, nxacts, sizeof(TransactionId),
xidComparator);

seems to solve the described problem, but I'm not in the context of [0] and
why array is allocated in builder->context.

[0] https://postgr.es/m/81D0D8B0-E7C4-4999-B616-1E5004DBDCD2%40amazon.com
[1]
https://www.postgresql.org/message-id/flat/CACG=ezaa4vqYjJ16yoxgrpa-=gXnf0Vv3Ey9bjGrRRFN2YyWFQ(at)mail(dot)gmail(dot)com

--
Best regards,
Maxim Orlov.

Attachment Content-Type Size
0001-catalog_change_snapshot-fail.patch.txt text/plain 32.6 KB

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Maxim Orlov 2022-11-21 13:08:05 Re: [BUG] Logical replication failure "ERROR: could not map filenode "base/13237/442428" to relation OID" with catalog modifying txns
Previous Message Amit Langote 2022-11-21 12:46:03 Re: ExecRTCheckPerms() and many prunable partitions