Re: Assertion failure in SnapBuildInitialSnapshot()

From: Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>
To: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
Cc: "Zhijie Hou (Fujitsu)" <houzj(dot)fnst(at)fujitsu(dot)com>, Pradeep Kumar <spradeepkumar29(at)gmail(dot)com>, Alexander Lakhin <exclusion(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>, vignesh C <vignesh21(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, "Hayato Kuroda (Fujitsu)" <kuroda(dot)hayato(at)fujitsu(dot)com>, Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: Assertion failure in SnapBuildInitialSnapshot()
Date: 2025-11-24 19:30:25
Message-ID: CAD21AoAKED+XSZA187x-uVv=PSM4-0b2R-zgNBMSw2tj9LEkZA@mail.gmail.com
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Mon, Nov 24, 2025 at 10:48 AM Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com> wrote:
>
> On Mon, Nov 24, 2025 at 1:46 AM Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
> >
> > On Fri, Nov 21, 2025 at 9:17 AM Zhijie Hou (Fujitsu)
> > <houzj(dot)fnst(at)fujitsu(dot)com> wrote:
> > >
> > > On Thursday, November 13, 2025 12:56 PM Zhijie Hou (Fujitsu) <houzj(dot)fnst(at)fujitsu(dot)com> wrote:
> > > >
> > >
> > > I have been thinking if there a way to avoid holding ReplicationSlotControlLock
> > > exclusively in ReplicationSlotsComputeRequiredXmin() because that could cause
> > > lock contention when many slots exist and advancements occur frequently.
> > >
> > > Given that the bug arises from a race condition between slot creation and
> > > concurrent slot xmin computation, I think another way is that, we acquire the
> > > ReplicationSlotControlLock exclusively only during slot creation to do the
> > > initial update of the slot xmin. In ReplicationSlotsComputeRequiredXmin(), we
> > > still hold the ReplicationSlotControlLock in shared mode until the global slot
> > > xmin is updated in ProcArraySetReplicationSlotXmin(). This approach prevents
> > > concurrent computations and updates of new xmin horizons by other backends
> > > during the initial slot xmin update process, while it still permits concurrent
> > > calls to ReplicationSlotsComputeRequiredXmin().
> > >
> >
> > Yeah, this seems to work.
>
> +1

Given that the computation of xmin and catalog_xmin among all slots
could be executed concurrently, could the following scenario happen
where procArray->replication_slot_xmin and
procArray->replication_slot_catalog_xmin are retreat to a non-invalid
XID?

1. Suppose the initial value procArray->replication_slot_catalog_xmin is 50.
2. Process-A updates its owned slot's catalog_xmin to 100, and
computes the new catalog_xmin as 100 while holding
ReplicationSlotControlLock in a shared mode in
ReplicationSlotsComputeRequiredLSN(). But it doesn't update the
procArray's catalog_xmin value yet.
3. Process-B updates its owned slot's catalog_xmin to 150, and
computes the new catalog_xmin as 150.
4. Process-B updates the procArray->replication_slot_catalog_xmin to 150.
5. Process-A updates the procArray->repilcation_slot_catalog_xmin to
100, which was 150.

It might be worth adding an assertion to
ProcArraySetReplicationSlotXmin(), checking if the new xmin and
catalog_xmin values are either >= the current values or an
InvalidTransactionId.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Robert Haas 2025-11-24 19:32:24 Re: get rid of Pointer type, mostly
Previous Message Tom Lane 2025-11-24 19:15:13 Re: get rid of Pointer type, mostly