RE: Assertion failure in SnapBuildInitialSnapshot()

From: "Zhijie Hou (Fujitsu)" <houzj(dot)fnst(at)fujitsu(dot)com>
To: Pradeep Kumar <spradeepkumar29(at)gmail(dot)com>, Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>
Cc: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, Alexander Lakhin <exclusion(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>, vignesh C <vignesh21(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, "Hayato Kuroda (Fujitsu)" <kuroda(dot)hayato(at)fujitsu(dot)com>, Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: RE: Assertion failure in SnapBuildInitialSnapshot()
Date: 2025-11-13 04:56:01
Message-ID: TY4PR01MB1690784F8EFFDC149FE67FAF594CDA@TY4PR01MB16907.jpnprd01.prod.outlook.com
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Wednesday, November 12, 2025 7:27 PM Pradeep Kumar <spradeepkumar29(at)gmail(dot)com> wote:
> I've been investigating the assert failure in
> ProcArraySetReplicationSlotXmin() and would like to share my approach and get
> feedback. Instead of inverting the locks and what robert shared before [1].
> Instead of unconditionally updating procArray->replication_slot_xmin in
> ProcArraySetReplicationSlotXmin() in procarray.c, I made the updates
> conditional:
> 1) Only update if the incoming xmin is valid
> 2) Only update if it's older than the currently stored xmin
> 3) Do the same for procArray->replication_slot_catalog_xmin
...
> In above block of code ensures we always track the minimum xmin across all
> active replication slots without losing data. And also no need to worry about
> locks. And also while reproducing this issue [2] In SnapBuildInitialSnapshot()
> while we computing safexid by calling GetOldestSafeDecodingTransactionId(false)
> will enters into first case and update the oldestSafeXid =
> procArray->replication_slot_xmin. So it won't return nextXid. And also it solves
> this issue [2].

Thanks for evaluating new approach, but I think this approach could not work
because we expect replication_slot_xmin to be set to an invalid number when the
last slot is dropped, while this approach would disallow that, causing WALs to
be retained. For a detailed explanation, please refer to [1].

While testing the patches across all branches, I noticed that an additional lock
needs to be added in the launcher.c where
ReplicationSlotsComputeRequiredXmin(true) was recently added for conflict
detection slot. I have modified the original patch accordingly.

BTW, I am not adding a test using an injection point because it does not seem
practical to insert an injection point inner
ReplicationSlotsComputeRequiredXmin. The reason is that the injection point
function internally calls CHECK_FOR_INTERRUPTS(), but the key functions in the
patch holds the lwlock, holding holds interrupts.

I am sharing the patches for all branches for reference.

[1] https://www.postgresql.org/message-id/TY4PR01MB169070EE618FA2908B3D2F2AE94C3A%40TY4PR01MB16907.jpnprd01.prod.outlook.com

Best Regards,
Hou zj

Attachment Content-Type Size
v3HEAD-0001-Fix-a-race-condition-of-updating-procArray-replic.patch application/octet-stream 7.3 KB
v3PG18-17-0001-Fix-a-race-condition-of-updating-procArray-re.patch application/octet-stream 6.5 KB
v3PG16-13-0001-Fix-a-race-condition-of-updating-procArray-re.patch application/octet-stream 5.6 KB

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Chao Li 2025-11-13 05:14:37 Re: Suggestion to add --continue-client-on-abort option to pgbench
Previous Message Shinya Kato 2025-11-13 04:49:40 Re: Add mode column to pg_stat_progress_vacuum