Re: 024_add_drop_pub.pl might fail due to deadlock

From: Ajin Cherian <itsajin(at)gmail(dot)com>
To: Alexander Lakhin <exclusion(at)gmail(dot)com>
Cc: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: 024_add_drop_pub.pl might fail due to deadlock
Date: 2025-07-08 10:41:18
Message-ID: CAFPTHDYucxiwZ-oVy0CV0Z0iviyy_vDWE=p+=csH66oo+8odDw@mail.gmail.com
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Mon, Jul 7, 2025 at 8:15 PM Ajin Cherian <itsajin(at)gmail(dot)com> wrote:
>
> On Sun, Jul 6, 2025 at 2:00 AM Alexander Lakhin <exclusion(at)gmail(dot)com> wrote:
> >
> > --- a/src/backend/replication/logical/origin.c
> > +++ b/src/backend/replication/logical/origin.c
> > @@ -428,6 +428,7 @@ replorigin_drop_by_name(const char *name, bool missing_ok, bool nowait)
> > * the specific origin and then re-check if the origin still exists.
> > */
> > rel = table_open(ReplicationOriginRelationId, ExclusiveLock);
> > +pg_usleep(300000);
> >
> > Not reproduced on REL_16_STABLE (since f6c5edb8a), nor in v14- (because
> > 024_add_drop_pub.pl was added in v15).
> >
> > [1] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=petalura&dt=2025-07-01%2018%3A00%3A58
> >
> > Best regards,
> > Alexander
> >
>
> Hi Alexander,
>
> Yes, the problem can be reproduced by the changes you suggested. I
> will look into what is happening and how we can fix this.

The issue appears to be a deadlock caused by inconsistent lock
acquisition order between two processes:

Process A (executing ALTER SUBSCRIPTION tap_sub DROP PUBLICATION tap_pub_1):
In AlterSubscription_refresh(), it first acquires an
AccessExclusiveLock on SubscriptionRelRelationId (resource 1), then
later tries to acquire an ExclusiveLock on ReplicationOriginRelationId
(resource 2).

Process B (apply worker):
In process_syncing_tables_for_apply(), it first acquires an
ExclusiveLock on ReplicationOriginRelationId (resource 2), then calls
UpdateSubscriptionRelState(), which tries to acquire a AccessShareLock
on SubscriptionRelRelationId (resource 1).

This leads to a deadlock:
Process A holds a lock on resource 1 and waits for resource 2, while
process B holds a lock on resource 2 and waits for resource 1.

Proposed fix:
In process_syncing_tables_for_apply(), acquire an AccessExclusiveLock
on SubscriptionRelRelationId before acquiring the lock on
ReplicationOriginRelationId.

Patch with fix attached.
I'll continue investigating whether this issue also affects HEAD.

regards,
Ajin Cherian
Fujitsu Australia.

Attachment Content-Type Size
0001-Fix-a-deadlock-during-ALTER-SUBSCRIPTION-.-DROP-PUBL.patch application/octet-stream 2.2 KB

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message shveta malik 2025-07-08 10:41:19 Re: POC: enable logical decoding when wal_level = 'replica' without a server restart
Previous Message shveta malik 2025-07-08 10:32:47 Re: Using failover slots for PG-non_PG logical replication