Re: Deadlock between logrep apply worker and tablesync worker

From: vignesh C <vignesh21(at)gmail(dot)com>
To: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
Cc: "houzj(dot)fnst(at)fujitsu(dot)com" <houzj(dot)fnst(at)fujitsu(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, "pgsql-hackers(at)lists(dot)postgresql(dot)org" <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: Deadlock between logrep apply worker and tablesync worker
Date: 2023-01-30 03:50:10
Message-ID: CALDaNm2yFD3yC4G+1WXdywmf=iEYCw-zqmBtJC93Yj1LoQY99A@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Sat, 28 Jan 2023 at 11:26, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
>
> On Sat, Jan 28, 2023 at 9:36 AM houzj(dot)fnst(at)fujitsu(dot)com
> <houzj(dot)fnst(at)fujitsu(dot)com> wrote:
> >
> > On Friday, January 27, 2023 8:16 PM Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
> > >
> > > On Fri, Jan 27, 2023 at 3:45 PM vignesh C <vignesh21(at)gmail(dot)com> wrote:
> > > >
> > > > On Mon, 23 Jan 2023 at 10:52, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
> > > wrote:
> > > > >
> > > > > IIRC, this is done to prevent concurrent drops of origin drop say by
> > > > > exposed API pg_replication_origin_drop(). See the discussion in [1]
> > > > > related to it. If we want we can optimize it so that we can acquire
> > > > > the lock on the specific origin as mentioned in comments
> > > > > replorigin_drop_by_name() but it was not clear that this operation
> > > > > would be frequent enough.
> > > >
> > > > Here is an attached patch to lock the replication origin record using
> > > > LockSharedObject instead of locking pg_replication_origin relation in
> > > > ExclusiveLock mode. Now tablesync worker will wait only if the
> > > > tablesync worker is trying to drop the same replication origin which
> > > > has already been dropped by the apply worker, the other tablesync
> > > > workers will be able to successfully drop the replication origin
> > > > without any wait.
> > > >
> > >
> > > There is a code in the function replorigin_drop_guts() that uses the
> > > functionality introduced by replorigin_exists(). Can we reuse this function for
> > > the same?
> >
> > Maybe we can use SearchSysCacheExists1 to check the existence instead of
> > adding a new function.
> >
>
> Yeah, I think that would be better.
>
> > One comment about the patch.
> >
> > @@ -430,23 +445,21 @@ replorigin_drop_by_name(const char *name, bool missing_ok, bool nowait)
> > ...
> > + /* Drop the replication origin if it has not been dropped already */
> > + if (replorigin_exists(roident))
> > replorigin_drop_guts(rel, roident, nowait);
> >
> > If developer pass missing_ok as false, should we report an ERROR here
> > instead of silently return ?
> >
>
> One thing that looks a bit odd is that we will anyway have a similar
> check in replorigin_drop_guts() which is a static function and called
> from only one place, so, will it be required to check at both places?

There is a possibility that the initial check to verify if replication
origin exists in replorigin_drop_by_name was successful but later one
of either table sync worker or apply worker process might have dropped
the replication origin, so it is better to check again before calling
replorigin_drop_guts, ideally the tuple should be valid in
replorigin_drop_guts, but can we keep the check as it is to maintain
the consistency before calling CatalogTupleDelete.

Regards,
Vignesh

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Dilip Kumar 2023-01-30 04:08:17 Re: [PoC] Improve dead tuple storage for lazy vacuum
Previous Message shiy.fnst@fujitsu.com 2023-01-30 03:36:59 RE: Logical replication timeout problem