RE: Assertion failure in SnapBuildInitialSnapshot()

From: "Zhijie Hou (Fujitsu)" <houzj(dot)fnst(at)fujitsu(dot)com>
To: "Zhijie Hou (Fujitsu)" <houzj(dot)fnst(at)fujitsu(dot)com>, Pradeep Kumar <spradeepkumar29(at)gmail(dot)com>, Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>
Cc: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, Alexander Lakhin <exclusion(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>, vignesh C <vignesh21(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, "Hayato Kuroda (Fujitsu)" <kuroda(dot)hayato(at)fujitsu(dot)com>, Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: RE: Assertion failure in SnapBuildInitialSnapshot()
Date: 2025-11-21 03:47:30
Message-ID: TY4PR01MB1690722DA11C85E1686F739DF94D5A@TY4PR01MB16907.jpnprd01.prod.outlook.com
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Thursday, November 13, 2025 12:56 PM Zhijie Hou (Fujitsu) <houzj(dot)fnst(at)fujitsu(dot)com> wrote:
>
> While testing the patches across all branches, I noticed that an additional lock
> needs to be added in the launcher.c where
> ReplicationSlotsComputeRequiredXmin(true) was recently added for conflict
> detection slot. I have modified the original patch accordingly.
>
> BTW, I am not adding a test using an injection point because it does not seem
> practical to insert an injection point inner
> ReplicationSlotsComputeRequiredXmin. The reason is that the injection point
> function internally calls CHECK_FOR_INTERRUPTS(), but the key functions in
> the patch holds the lwlock, holding holds interrupts.
>
> I am sharing the patches for all branches for reference.

I have been thinking if there a way to avoid holding ReplicationSlotControlLock
exclusively in ReplicationSlotsComputeRequiredXmin() because that could cause
lock contention when many slots exist and advancements occur frequently.

Given that the bug arises from a race condition between slot creation and
concurrent slot xmin computation, I think another way is that, we acquire the
ReplicationSlotControlLock exclusively only during slot creation to do the
initial update of the slot xmin. In ReplicationSlotsComputeRequiredXmin(), we
still hold the ReplicationSlotControlLock in shared mode until the global slot
xmin is updated in ProcArraySetReplicationSlotXmin(). This approach prevents
concurrent computations and updates of new xmin horizons by other backends
during the initial slot xmin update process, while it still permits concurrent
calls to ReplicationSlotsComputeRequiredXmin().

Here is an update patch for this approach on HEAD.

Best Regards,
Hou zj

Attachment Content-Type Size
v4HEAD-0001-Fix-a-race-condition-of-updating-procArray-re.patch application/octet-stream 7.8 KB

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Rahila Syed 2025-11-21 04:28:23 Re: Clarification on when _PG_init() is invoked for extensions
Previous Message Ajin Cherian 2025-11-21 03:44:25 Re: Improve pg_sync_replication_slots() to wait for primary to advance