Re: Assertion failure in SnapBuildInitialSnapshot()

From: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
To: Andres Freund <andres(at)anarazel(dot)de>
Cc: Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>, "Hayato Kuroda (Fujitsu)" <kuroda(dot)hayato(at)fujitsu(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Thomas Munro <thomas(dot)munro(at)gmail(dot)com>
Subject: Re: Assertion failure in SnapBuildInitialSnapshot()
Date: 2023-02-08 04:13:16
Message-ID: CAA4eK1LryoznTWB_Y7x=oXwpmJ-w4=3JHL8CRAHh0EuJuRUs3A@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Wed, Feb 8, 2023 at 1:19 AM Andres Freund <andres(at)anarazel(dot)de> wrote:
>
> On 2023-02-01 11:23:57 +0530, Amit Kapila wrote:
> > On Tue, Jan 31, 2023 at 6:08 PM Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com> wrote:
> > >
> > > Attached updated patches.
> > >
> >
> > Thanks, Andres, others, do you see a better way to fix this problem? I
> > have reproduced it manually and the steps are shared at [1] and
> > Sawada-San also reproduced it, see [2].
> >
> > [1] - https://www.postgresql.org/message-id/CAA4eK1KDFeh%3DZbvSWPx%3Dir2QOXBxJbH0K8YqifDtG3xJENLR%2Bw%40mail.gmail.com
> > [2] - https://www.postgresql.org/message-id/CAD21AoDKJBB6p4X-%2B057Vz44Xyc-zDFbWJ%2Bg9FL6qAF5PC2iFg%40mail.gmail.com
>
> Hm. It's worrysome to now hold ProcArrayLock exclusively while iterating over
> the slots. ReplicationSlotsComputeRequiredXmin() can be called at a
> non-neglegible frequency. Callers like CreateInitDecodingContext(), that pass
> already_locked=true worry me a lot less, because obviously that's not a very
> frequent operation.
>
> This is particularly not great because we need to acquire
> ReplicationSlotControlLock while already holding ProcArrayLock.
>
>
> But clearly there's a pretty large hole in the lock protection right now. I'm
> a bit confused about why we (Robert and I, or just I) thought it's ok to do it
> this way.
>
>
> I wonder if we could instead invert the locks, and hold
> ReplicationSlotControlLock until after ProcArraySetReplicationSlotXmin(), and
> acquire ProcArrayLock just for ProcArraySetReplicationSlotXmin().
>

Along with inverting, doesn't this mean that we need to acquire
ReplicationSlotControlLock in Exclusive mode instead of acquiring it
in shared mode? My understanding of the above locking scheme is that
in CreateInitDecodingContext(), we acquire ReplicationSlotControlLock
in Exclusive mode before acquiring ProcArrayLock in Exclusive mode and
release it after releasing ProcArrayLock. Then,
ReplicationSlotsComputeRequiredXmin() acquires
ReplicationSlotControlLock in Exclusive mode only when already_locked
is false and releases it after a call to
ProcArraySetReplicationSlotXmin(). ProcArraySetReplicationSlotXmin()
won't change.

I don't think just inverting the order without changing the lock mode
will solve the problem because still apply worker will be able to
override the replication_slot_xmin value.

--
With Regards,
Amit Kapila.

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Michael Paquier 2023-02-08 04:24:48 Re: OpenSSL 3.0.0 vs old branches
Previous Message Michael Paquier 2023-02-08 04:12:15 Re: Worth using personality(ADDR_NO_RANDOMIZE) for EXEC_BACKEND on linux?