Re: Assertion failure in SnapBuildInitialSnapshot()

From: vignesh C <vignesh21(at)gmail(dot)com>
To: Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>
Cc: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, "Hayato Kuroda (Fujitsu)" <kuroda(dot)hayato(at)fujitsu(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Thomas Munro <thomas(dot)munro(at)gmail(dot)com>
Subject: Re: Assertion failure in SnapBuildInitialSnapshot()
Date: 2024-01-11 14:25:52
Message-ID: CALDaNm2sjbP9dNymzA5w=Xk33MDi+Ap7wj=fnktjxYZj3d_dxw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Thu, 9 Feb 2023 at 12:02, Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com> wrote:
>
> On Wed, Feb 8, 2023 at 1:13 PM Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
> >
> > On Wed, Feb 8, 2023 at 1:19 AM Andres Freund <andres(at)anarazel(dot)de> wrote:
> > >
> > > On 2023-02-01 11:23:57 +0530, Amit Kapila wrote:
> > > > On Tue, Jan 31, 2023 at 6:08 PM Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com> wrote:
> > > > >
> > > > > Attached updated patches.
> > > > >
> > > >
> > > > Thanks, Andres, others, do you see a better way to fix this problem? I
> > > > have reproduced it manually and the steps are shared at [1] and
> > > > Sawada-San also reproduced it, see [2].
> > > >
> > > > [1] - https://www.postgresql.org/message-id/CAA4eK1KDFeh%3DZbvSWPx%3Dir2QOXBxJbH0K8YqifDtG3xJENLR%2Bw%40mail.gmail.com
> > > > [2] - https://www.postgresql.org/message-id/CAD21AoDKJBB6p4X-%2B057Vz44Xyc-zDFbWJ%2Bg9FL6qAF5PC2iFg%40mail.gmail.com
> > >
> > > Hm. It's worrysome to now hold ProcArrayLock exclusively while iterating over
> > > the slots. ReplicationSlotsComputeRequiredXmin() can be called at a
> > > non-neglegible frequency. Callers like CreateInitDecodingContext(), that pass
> > > already_locked=true worry me a lot less, because obviously that's not a very
> > > frequent operation.
> > >
> > > This is particularly not great because we need to acquire
> > > ReplicationSlotControlLock while already holding ProcArrayLock.
> > >
> > >
> > > But clearly there's a pretty large hole in the lock protection right now. I'm
> > > a bit confused about why we (Robert and I, or just I) thought it's ok to do it
> > > this way.
> > >
> > >
> > > I wonder if we could instead invert the locks, and hold
> > > ReplicationSlotControlLock until after ProcArraySetReplicationSlotXmin(), and
> > > acquire ProcArrayLock just for ProcArraySetReplicationSlotXmin().
> > >
> >
> > Along with inverting, doesn't this mean that we need to acquire
> > ReplicationSlotControlLock in Exclusive mode instead of acquiring it
> > in shared mode? My understanding of the above locking scheme is that
> > in CreateInitDecodingContext(), we acquire ReplicationSlotControlLock
> > in Exclusive mode before acquiring ProcArrayLock in Exclusive mode and
> > release it after releasing ProcArrayLock. Then,
> > ReplicationSlotsComputeRequiredXmin() acquires
> > ReplicationSlotControlLock in Exclusive mode only when already_locked
> > is false and releases it after a call to
> > ProcArraySetReplicationSlotXmin(). ProcArraySetReplicationSlotXmin()
> > won't change.
>
> I've attached the patch of this idea for discussion. In
> GetOldestSafeDecodingTransactionId() called by
> CreateInitDecodingContext(), we hold ReplicationSlotControlLock,
> ProcArrayLock, and XidGenLock at a time. So we would need to be
> careful about the ordering.

I have changed the status of the patch to "Waiting on Author" as
Robert's issues were not addressed yet. Feel free to change the status
accordingly after addressing them.

Regards,
Vignesh

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Melanie Plageman 2024-01-11 14:27:53 Re: Show WAL write and fsync stats in pg_stat_io
Previous Message Michael Banck 2024-01-11 14:24:55 Re: Set log_lock_waits=on by default