Re: Assertion failure in SnapBuildInitialSnapshot()

From: Andres Freund <andres(at)anarazel(dot)de>
To: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
Cc: Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>, "Hayato Kuroda (Fujitsu)" <kuroda(dot)hayato(at)fujitsu(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Thomas Munro <thomas(dot)munro(at)gmail(dot)com>
Subject: Re: Assertion failure in SnapBuildInitialSnapshot()
Date: 2023-02-07 19:49:03
Message-ID: 20230207194903.ws4acm7ake6ikacn@awork3.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

On 2023-02-01 11:23:57 +0530, Amit Kapila wrote:
> On Tue, Jan 31, 2023 at 6:08 PM Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com> wrote:
> >
> > Attached updated patches.
> >
>
> Thanks, Andres, others, do you see a better way to fix this problem? I
> have reproduced it manually and the steps are shared at [1] and
> Sawada-San also reproduced it, see [2].
>
> [1] - https://www.postgresql.org/message-id/CAA4eK1KDFeh%3DZbvSWPx%3Dir2QOXBxJbH0K8YqifDtG3xJENLR%2Bw%40mail.gmail.com
> [2] - https://www.postgresql.org/message-id/CAD21AoDKJBB6p4X-%2B057Vz44Xyc-zDFbWJ%2Bg9FL6qAF5PC2iFg%40mail.gmail.com

Hm. It's worrysome to now hold ProcArrayLock exclusively while iterating over
the slots. ReplicationSlotsComputeRequiredXmin() can be called at a
non-neglegible frequency. Callers like CreateInitDecodingContext(), that pass
already_locked=true worry me a lot less, because obviously that's not a very
frequent operation.

This is particularly not great because we need to acquire
ReplicationSlotControlLock while already holding ProcArrayLock.

But clearly there's a pretty large hole in the lock protection right now. I'm
a bit confused about why we (Robert and I, or just I) thought it's ok to do it
this way.

I wonder if we could instead invert the locks, and hold
ReplicationSlotControlLock until after ProcArraySetReplicationSlotXmin(), and
acquire ProcArrayLock just for ProcArraySetReplicationSlotXmin(). That'd mean
that already_locked = true callers have to do a bit more work (we have to be
sure the locks are always acquired in the same order, or we end up in
unresolved deadlock land), but I think we can live with that.

This would still allow concurrent invocations of
ReplicationSlotsComputeRequiredXmin() come up with slightly different values,
but that's possible with the proposed patch as well, as effective_xmin is
updated without any of the other locks. But I don't see a problem with that.

Greetings,

Andres Freund

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Andres Freund 2023-02-07 20:05:20 Re: Assertion failure in SnapBuildInitialSnapshot()
Previous Message Pavel Stehule 2023-02-07 19:48:22 possible proposal plpgsql GET DIAGNOSTICS oid = PG_ROUTINE_OID