Re: undetected deadlock in ALTER SUBSCRIPTION ... REFRESH PUBLICATION

From: Tomas Vondra <tomas(dot)vondra(at)enterprisedb(dot)com>
To: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
Cc: PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: undetected deadlock in ALTER SUBSCRIPTION ... REFRESH PUBLICATION
Date: 2023-11-22 11:21:25
Message-ID: 69e9a1af-a315-8b05-0674-2cc53bf27682@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 11/22/23 11:38, Amit Kapila wrote:
> On Tue, Nov 21, 2023 at 6:56 PM Tomas Vondra
> <tomas(dot)vondra(at)enterprisedb(dot)com> wrote:
>>
>> On 11/21/23 14:16, Amit Kapila wrote:
>>> On Tue, Nov 21, 2023 at 5:17 PM Tomas Vondra
>>> <tomas(dot)vondra(at)enterprisedb(dot)com> wrote:
>>>>
>>>
>>> It seems there is some inconsistency in what you have written for
>>> client backends/tablesync worker vs. apply worker. The above text
>>> seems to be saying that the client backend and table sync worker are
>>> waiting on a "subscription row in pg_subscription" and the apply
>>> worker is operating on "pg_subscription_rel". So, if that is true then
>>> they shouldn't get stuck.
>>>
>>> I think here client backend and tablesync worker seems to be blocked
>>> for a lock on pg_subscription_rel.
>>>
>>
>> Not really, they are all locking the subscription. All the locks are on
>> classid=6100, which is pg_subscription:
>>
>> test=# select 6100::regclass;
>> regclass
>> -----------------
>> pg_subscription
>> (1 row)
>>
>> The thing is, the tablesync workers call UpdateSubscriptionRelState,
>> which locks the pg_subscription catalog at the very beginning:
>>
>> LockSharedObject(SubscriptionRelationId, ...);
>>
>> So that's the issue. I haven't explored why it's done this way, and
>> there's no comment explaining locking the subscriptions is needed ...
>>
>
> I think it prevents concurrent drop of rel during the REFRESH operation.
>

Yes. Or maybe some other concurrent DDL on the relations included in the
subscription.

>>>> The tablesync workers can't proceed because their lock request is stuck
>>>> behind the AccessExclusiveLock request.
>>>>
>>>> And the apply worker can't proceed, because it's waiting for status
>>>> update from the tablesync workers.
>>>>
>>>
>>> This part is not clear to me because
>>> wait_for_relation_state_change()->GetSubscriptionRelState() seems to
>>> be releasing the lock while closing the relation. Am, I missing
>>> something?
>>>
>>
>> I think you're missing the fact that GetSubscriptionRelState() acquires
>> and releases the lock on pg_subscription_rel, but that's not the lock
>> causing the issue. The problem is the lock on the pg_subscription row.
>>
>
> Okay. IIUC, what's going on here is that the apply worker acquires
> AccessShareLock on pg_subscription to update rel state for one of the
> tables say tbl-1, and then for another table say tbl-2, it started
> waiting for a state change via wait_for_relation_state_change(). I
> think here the fix is to commit the transaction before we go for a
> wait. I guess we need something along the lines of what is proposed in
> [1] though we have solved the problem in that thread in some other
> way..
>

Possibly. I haven't checked if the commit might have some unexpected
consequences, but I can confirm I can no longer reproduce the deadlock
with the patch applied.

regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Ashutosh Bapat 2023-11-22 11:30:33 Re: Changing baserel to foreignrel in postgres_fdw functions
Previous Message Amit Kapila 2023-11-22 10:38:49 Re: undetected deadlock in ALTER SUBSCRIPTION ... REFRESH PUBLICATION