RE: Deadlock between logrep apply worker and tablesync worker

From: "houzj(dot)fnst(at)fujitsu(dot)com" <houzj(dot)fnst(at)fujitsu(dot)com>
To: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, vignesh C <vignesh21(at)gmail(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, "pgsql-hackers(at)lists(dot)postgresql(dot)org" <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: RE: Deadlock between logrep apply worker and tablesync worker
Date: 2023-01-30 07:30:04
Message-ID: OS0PR01MB571678FF1990BE28BAD861F294D39@OS0PR01MB5716.jpnprd01.prod.outlook.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Monday, January 30, 2023 2:32 PM Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
>
> On Mon, Jan 30, 2023 at 9:20 AM vignesh C <vignesh21(at)gmail(dot)com> wrote:
> >
> > On Sat, 28 Jan 2023 at 11:26, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
> > >
> > > One thing that looks a bit odd is that we will anyway have a similar
> > > check in replorigin_drop_guts() which is a static function and
> > > called from only one place, so, will it be required to check at both places?
> >
> > There is a possibility that the initial check to verify if replication
> > origin exists in replorigin_drop_by_name was successful but later one
> > of either table sync worker or apply worker process might have dropped
> > the replication origin,
> >
>
> Won't locking on the particular origin prevent concurrent drops? IIUC, the
> drop happens after the patch acquires the lock on the origin.

Yes, I think the existence check in replorigin_drop_guts is unnecessary as we
already lock the origin before that. I think the check in replorigin_drop_guts
is a custom check after calling SearchSysCache1 to get the tuple, but the error
should not happen as no concurrent drop can be performed.

To make it simpler, one idea is to move the code that getting the tuple from
system cache to the replorigin_drop_by_name(). After locking the origin, we
can try to get the tuple and do the existence check, and we can reuse
this tuple to perform origin delete. In this approach we only need to check
origin existence once after locking. BTW, if we do this, then we'd better rename the
replorigin_drop_guts() to something like replorigin_state_clear() as the function
only clear the in-memory information after that.

The code could be like:

-------
replorigin_drop_by_name(const char *name, bool missing_ok, bool nowait)
...
/*
* Lock the origin to prevent concurrent drops. We keep the lock until the
* end of transaction.
*/
LockSharedObject(ReplicationOriginRelationId, roident, 0,
AccessExclusiveLock);

tuple = SearchSysCache1(REPLORIGIDENT, ObjectIdGetDatum(roident));
if (!HeapTupleIsValid(tuple))
{
if (!missing_ok)
elog(ERROR, "cache lookup failed for replication origin with ID %d",
roident);

return;
}

replorigin_state_clear(rel, roident, nowait);

/*
* Now, we can delete the catalog entry.
*/
CatalogTupleDelete(rel, &tuple->t_self);
ReleaseSysCache(tuple);

CommandCounterIncrement();
...
-------

Best Regards,
Hou zj

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Peter Eisentraut 2023-01-30 07:37:42 meson: Optionally disable installation of test modules
Previous Message Tom Lane 2023-01-30 07:28:54 Re: generate_series for timestamptz and time zone problem