Re: Assertion failure in SnapBuildInitialSnapshot()

From: vignesh C <vignesh21(at)gmail(dot)com>
To: Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>
Cc: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, "Hayato Kuroda (Fujitsu)" <kuroda(dot)hayato(at)fujitsu(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Thomas Munro <thomas(dot)munro(at)gmail(dot)com>
Subject: Re: Assertion failure in SnapBuildInitialSnapshot()
Date: 2024-02-01 18:20:55
Message-ID: CALDaNm00DvZuqnV=VfG970-nYviiAb49URzQbNVxrmi12KNa4Q@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Thu, 11 Jan 2024 at 19:55, vignesh C <vignesh21(at)gmail(dot)com> wrote:
>
> On Thu, 9 Feb 2023 at 12:02, Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com> wrote:
> >
> > On Wed, Feb 8, 2023 at 1:13 PM Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
> > >
> > > On Wed, Feb 8, 2023 at 1:19 AM Andres Freund <andres(at)anarazel(dot)de> wrote:
> > > >
> > > > On 2023-02-01 11:23:57 +0530, Amit Kapila wrote:
> > > > > On Tue, Jan 31, 2023 at 6:08 PM Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com> wrote:
> > > > > >
> > > > > > Attached updated patches.
> > > > > >
> > > > >
> > > > > Thanks, Andres, others, do you see a better way to fix this problem? I
> > > > > have reproduced it manually and the steps are shared at [1] and
> > > > > Sawada-San also reproduced it, see [2].
> > > > >
> > > > > [1] - https://www.postgresql.org/message-id/CAA4eK1KDFeh%3DZbvSWPx%3Dir2QOXBxJbH0K8YqifDtG3xJENLR%2Bw%40mail.gmail.com
> > > > > [2] - https://www.postgresql.org/message-id/CAD21AoDKJBB6p4X-%2B057Vz44Xyc-zDFbWJ%2Bg9FL6qAF5PC2iFg%40mail.gmail.com
> > > >
> > > > Hm. It's worrysome to now hold ProcArrayLock exclusively while iterating over
> > > > the slots. ReplicationSlotsComputeRequiredXmin() can be called at a
> > > > non-neglegible frequency. Callers like CreateInitDecodingContext(), that pass
> > > > already_locked=true worry me a lot less, because obviously that's not a very
> > > > frequent operation.
> > > >
> > > > This is particularly not great because we need to acquire
> > > > ReplicationSlotControlLock while already holding ProcArrayLock.
> > > >
> > > >
> > > > But clearly there's a pretty large hole in the lock protection right now. I'm
> > > > a bit confused about why we (Robert and I, or just I) thought it's ok to do it
> > > > this way.
> > > >
> > > >
> > > > I wonder if we could instead invert the locks, and hold
> > > > ReplicationSlotControlLock until after ProcArraySetReplicationSlotXmin(), and
> > > > acquire ProcArrayLock just for ProcArraySetReplicationSlotXmin().
> > > >
> > >
> > > Along with inverting, doesn't this mean that we need to acquire
> > > ReplicationSlotControlLock in Exclusive mode instead of acquiring it
> > > in shared mode? My understanding of the above locking scheme is that
> > > in CreateInitDecodingContext(), we acquire ReplicationSlotControlLock
> > > in Exclusive mode before acquiring ProcArrayLock in Exclusive mode and
> > > release it after releasing ProcArrayLock. Then,
> > > ReplicationSlotsComputeRequiredXmin() acquires
> > > ReplicationSlotControlLock in Exclusive mode only when already_locked
> > > is false and releases it after a call to
> > > ProcArraySetReplicationSlotXmin(). ProcArraySetReplicationSlotXmin()
> > > won't change.
> >
> > I've attached the patch of this idea for discussion. In
> > GetOldestSafeDecodingTransactionId() called by
> > CreateInitDecodingContext(), we hold ReplicationSlotControlLock,
> > ProcArrayLock, and XidGenLock at a time. So we would need to be
> > careful about the ordering.
>
> I have changed the status of the patch to "Waiting on Author" as
> Robert's issues were not addressed yet. Feel free to change the status
> accordingly after addressing them.

The patch which you submitted has been awaiting your attention for
quite some time now. As such, we have moved it to "Returned with
Feedback" and removed it from the reviewing queue. Depending on
timing, this may be reversible. Kindly address the feedback you have
received, and resubmit the patch to the next CommitFest.

Regards,
Vignesh

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message vignesh C 2024-02-01 18:22:09 Re: Issue in postgres_fdw causing unnecessary wait for cancel request reply
Previous Message vignesh C 2024-02-01 18:19:39 Re: odd buildfarm failure - "pg_ctl: control file appears to be corrupt"