Re: Deadlock between logrep apply worker and tablesync worker

From: vignesh C <vignesh21(at)gmail(dot)com>
To: "houzj(dot)fnst(at)fujitsu(dot)com" <houzj(dot)fnst(at)fujitsu(dot)com>
Cc: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, "pgsql-hackers(at)lists(dot)postgresql(dot)org" <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: Deadlock between logrep apply worker and tablesync worker
Date: 2023-01-30 09:12:43
Message-ID: CALDaNm19uLVJAqrQpaKDKXqJ6HCOg-vTQznPeKmiPtu_FrDKBw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Mon, 30 Jan 2023 at 13:00, houzj(dot)fnst(at)fujitsu(dot)com
<houzj(dot)fnst(at)fujitsu(dot)com> wrote:
>
> On Monday, January 30, 2023 2:32 PM Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
> >
> > On Mon, Jan 30, 2023 at 9:20 AM vignesh C <vignesh21(at)gmail(dot)com> wrote:
> > >
> > > On Sat, 28 Jan 2023 at 11:26, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
> > > >
> > > > One thing that looks a bit odd is that we will anyway have a similar
> > > > check in replorigin_drop_guts() which is a static function and
> > > > called from only one place, so, will it be required to check at both places?
> > >
> > > There is a possibility that the initial check to verify if replication
> > > origin exists in replorigin_drop_by_name was successful but later one
> > > of either table sync worker or apply worker process might have dropped
> > > the replication origin,
> > >
> >
> > Won't locking on the particular origin prevent concurrent drops? IIUC, the
> > drop happens after the patch acquires the lock on the origin.
>
> Yes, I think the existence check in replorigin_drop_guts is unnecessary as we
> already lock the origin before that. I think the check in replorigin_drop_guts
> is a custom check after calling SearchSysCache1 to get the tuple, but the error
> should not happen as no concurrent drop can be performed.

This scenario is possible while creating subscription, apply worker
will try to drop the replication origin if the state is
SUBREL_STATE_SYNCDONE. Table sync worker will set the state to
SUBREL_STATE_SYNCDONE and update the relation state before calling
replorigin_drop_by_name. Since the transaction is committed by table
sync worker, the state is visible to apply worker, now apply worker
will parallelly try to drop the replication origin in this case.
There is a race condition in this case, one of the process table sync
worker or apply worker will acquire the lock and drop the replication
origin, the other process will get the lock after the process drops
the origin and commits the transaction. Now the other process will try
to drop the replication origin once it acquires the lock and get the
error(from replorigin_drop_guts): cache lookup failed for replication
origin with ID.
Concurrent drop is possible in this case.

Regards,
Vignesh

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message David Geier 2023-01-30 09:13:37 Re: Lazy JIT IR code generation to increase JIT speed with partitions
Previous Message Richard Guo 2023-01-30 08:55:31 Re: Check lateral references within PHVs for memoize cache keys