Re: Get stuck when dropping a subscription during synchronizing table

From: Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>
To: Petr Jelinek <petr(dot)jelinek(at)2ndquadrant(dot)com>
Cc: Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, tushar <tushar(dot)ahuja(at)enterprisedb(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp>, Michael Paquier <michael(dot)paquier(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Get stuck when dropping a subscription during synchronizing table
Date: 2017-06-28 05:13:11
Message-ID: CAD21AoDA=F0Sa7j3UC-aYyW+wLWGV+axgECP0b=ab-=+J8TgdQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Wed, Jun 28, 2017 at 1:47 AM, Petr Jelinek
<petr(dot)jelinek(at)2ndquadrant(dot)com> wrote:
>
> On 27/06/17 10:51, Masahiko Sawada wrote:
>> On Mon, Jun 26, 2017 at 12:12 PM, Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com> wrote:
>>
>> I've reviewed this patch briefly.
>
> Thanks!
>
>>
>> @@ -515,6 +533,31 @@ logicalrep_worker_stop(Oid subid, Oid relid)
>> }
>>
>> /*
>> + * Request worker to be stopped on commit.
>> + */
>> +void
>> +logicalrep_worker_stop_at_commit(Oid subid, Oid relid)
>> +{
>> + LogicalRepWorkerId *wid;
>> + MemoryContext old;
>> +
>> + old = MemoryContextSwitchTo(TopTransactionContext);
>> +
>> + wid = (LogicalRepWorkerId *) palloc(sizeof(LogicalRepWorkerId));
>> +
>> + /*
>> + wid = MemoryContextAlloc(TopTransactionContext,
>> +
>> sizeof(LogicalRepWorkerId));
>> + */
>> + wid->subid = subid;
>> + wid->relid = relid;
>> +
>> + on_commit_stop_workers = lappend(on_commit_stop_workers, wid);
>> +
>> + MemoryContextSwitchTo(old);
>> +}
>>
>> logicalrep_worker_stop_at_commit() has a problem that new_list()
>> called by lappend() allocates the memory from current memory context.
>> It should switch the memory context and then allocate the memory for
>> wid and append it to the list.
>>

Thank you for updating the patch!

>
> Right, fixed (I see you did that locally as well based on the above
> excerpt ;) ).

Oops, yeah that's my test code.

>> --------
>> @@ -754,9 +773,25 @@ ApplyLauncherShmemInit(void)
>> void
>> AtEOXact_ApplyLauncher(bool isCommit)
>> {
>> - if (isCommit && on_commit_launcher_wakeup)
>> - ApplyLauncherWakeup();
>> + ListCell *lc;
>> +
>> + if (isCommit)
>> + {
>> + foreach (lc, on_commit_stop_workers)
>> + {
>> + LogicalRepWorkerId *wid = lfirst(lc);
>> + logicalrep_worker_stop(wid->subid, wid->relid);
>> + }
>> +
>> + if (on_commit_launcher_wakeup)
>> + ApplyLauncherWakeup();
>>
>> Stopping the logical rep worker in AtEOXact_ApplyLauncher doesn't
>> support the prepared transaction. Since we allocate the list
>> on_commit_stop_workers in TopTransactionContext the postgres crashes
>> if we execute any query after prepared transaction that removes
>> subscription relation state. Also after fixed this issue, we still
>> need to something: the list of on_commit_stop_workers is not
>> associated the prepared transaction. A query next to "preapre
>> transaction" tries to stop workers at the commit. There was similar
>> discussion before.
>>
>
> Hmm, good point. I think for now it makes sense to simply don't allow
> PREPARE for transactions that manipulate workers similar to what we do
> when there are exported snapshots. Done it that way in attached.

Agreed. Should we note that in docs?

>
>> --------
>> +
>> + ensure_transaction();
>> +
>> UpdateSubscriptionRelState(MyLogicalRepWorker->subid,
>> +
>> rstate->relid, rstate->state,
>> +
>> rstate->lsn);
>>
>>
>> Should we commit the transaction if we started a new transaction
>> before update the subscription relation state, or it could be deadlock
>> risk.
>
> We only lock the whole subscription (and only conflicting things are
> DROP and ALTER SUBSCRIPTION), not individual subscription-relation
> mapping so it doesn't seem to me like there is any higher risk of
> deadlocks than anything else which works with multiple tables (or
> compared to previous behavior).
>

I'm concerned that a lock for whole subscription could be conflicted
between ALTER SUBSCRIPTION, table sync worker and apply worker:

Please imagine the following case.

1. The apply worker updates a subscription relation state R1 of
subscription S1.
-> It acquires AccessShareLock on S1, and keep holding.
2. ALTER SUBSCRIPTION SET PUBLICATION tries to acquire
AccessExclusiveLock on S1.
-> But it waits for the apply worker to release the lock.
3. The apply worker calls wait_for_relation_state_change(relid,
SUBREL_STATE_SYNCDONE) and waits for the table sync worker
for R2 to change its status.
-> Note that the apply worker is still holding AccessShareLock on S1
4. The table sync worker tries to update its status to SYNCDONE
-> In UpdateSubscriptionRelState(), it tires to acquire AccessShareLock
on S1 but waits for it. *deadlock*

To summary, because the apply worker keeps holding AccessShareLock on
S1, the acquiring AccessExclusiveLock by ALTER SUBSCRIPTION waits for
the apply worker, and then the table sync worker also waits for the
ALTER SUBSCRIPTION in order to change its status. And the apply worker
waits for the table sync worker to change its status.

I encountered the similar case once on my environment. But since it
completely depends on timing I don't have the concrete reproducing
steps.

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Thomas Munro 2017-06-28 05:19:08 Re: Server crash due to SIGBUS(Bus Error) when trying to access the memory created using dsm_create().
Previous Message Noah Misch 2017-06-28 03:22:18 Re: Get stuck when dropping a subscription during synchronizing table