Re: Deadlock between logrep apply worker and tablesync worker

From: vignesh C <vignesh21(at)gmail(dot)com>
To: "houzj(dot)fnst(at)fujitsu(dot)com" <houzj(dot)fnst(at)fujitsu(dot)com>
Cc: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, "pgsql-hackers(at)lists(dot)postgresql(dot)org" <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: Deadlock between logrep apply worker and tablesync worker
Date: 2023-01-30 17:06:32
Message-ID: CALDaNm2toSP3n3xaMPga_=y5zZ88P_cyZ-kyB2dyKKp3dE3uMg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Mon, 30 Jan 2023 at 17:30, vignesh C <vignesh21(at)gmail(dot)com> wrote:
>
> On Mon, 30 Jan 2023 at 13:00, houzj(dot)fnst(at)fujitsu(dot)com
> <houzj(dot)fnst(at)fujitsu(dot)com> wrote:
> >
> > On Monday, January 30, 2023 2:32 PM Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
> > >
> > > On Mon, Jan 30, 2023 at 9:20 AM vignesh C <vignesh21(at)gmail(dot)com> wrote:
> > > >
> > > > On Sat, 28 Jan 2023 at 11:26, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
> > > > >
> > > > > One thing that looks a bit odd is that we will anyway have a similar
> > > > > check in replorigin_drop_guts() which is a static function and
> > > > > called from only one place, so, will it be required to check at both places?
> > > >
> > > > There is a possibility that the initial check to verify if replication
> > > > origin exists in replorigin_drop_by_name was successful but later one
> > > > of either table sync worker or apply worker process might have dropped
> > > > the replication origin,
> > > >
> > >
> > > Won't locking on the particular origin prevent concurrent drops? IIUC, the
> > > drop happens after the patch acquires the lock on the origin.
> >
> > Yes, I think the existence check in replorigin_drop_guts is unnecessary as we
> > already lock the origin before that. I think the check in replorigin_drop_guts
> > is a custom check after calling SearchSysCache1 to get the tuple, but the error
> > should not happen as no concurrent drop can be performed.
> >
> > To make it simpler, one idea is to move the code that getting the tuple from
> > system cache to the replorigin_drop_by_name(). After locking the origin, we
> > can try to get the tuple and do the existence check, and we can reuse
> > this tuple to perform origin delete. In this approach we only need to check
> > origin existence once after locking. BTW, if we do this, then we'd better rename the
> > replorigin_drop_guts() to something like replorigin_state_clear() as the function
> > only clear the in-memory information after that.
> >
> > The code could be like:
> >
> > -------
> > replorigin_drop_by_name(const char *name, bool missing_ok, bool nowait)
> > ...
> > /*
> > * Lock the origin to prevent concurrent drops. We keep the lock until the
> > * end of transaction.
> > */
> > LockSharedObject(ReplicationOriginRelationId, roident, 0,
> > AccessExclusiveLock);
> >
> > tuple = SearchSysCache1(REPLORIGIDENT, ObjectIdGetDatum(roident));
> > if (!HeapTupleIsValid(tuple))
> > {
> > if (!missing_ok)
> > elog(ERROR, "cache lookup failed for replication origin with ID %d",
> > roident);
> >
> > return;
> > }
> >
> > replorigin_state_clear(rel, roident, nowait);
> >
> > /*
> > * Now, we can delete the catalog entry.
> > */
> > CatalogTupleDelete(rel, &tuple->t_self);
> > ReleaseSysCache(tuple);
> >
> > CommandCounterIncrement();
>
> The attached updated patch has the changes to handle the same.

I had not merged one of the local changes that was present, please
find the updated patch including that change. Sorry for missing that
change.

Regards,
Vignesh

Attachment Content-Type Size
v3-0001-Lock-the-replication-origin-record-instead-of-loc.patch text/x-patch 3.2 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message João Paulo Labegalini de Carvalho 2023-01-30 17:24:02 Re: Optimizing PostgreSQL with LLVM's PGO+LTO
Previous Message Tom Lane 2023-01-30 17:00:31 Re: pub/sub - specifying optional parameters without values.