Re: tablesync patch broke the assumption that logical rep depends on?

From: Petr Jelinek <petr(dot)jelinek(at)2ndquadrant(dot)com>
To: Fujii Masao <masao(dot)fujii(at)gmail(dot)com>, Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>
Cc: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: tablesync patch broke the assumption that logical rep depends on?
Date: 2017-04-14 19:52:35
Message-ID: 2779bea9-fcfa-5476-5168-9430b45fb64b@2ndquadrant.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 13/04/17 19:31, Fujii Masao wrote:
> On Fri, Apr 14, 2017 at 1:28 AM, Peter Eisentraut
> <peter(dot)eisentraut(at)2ndquadrant(dot)com> wrote:
>> On 4/10/17 13:28, Fujii Masao wrote:
>>> src/backend/replication/logical/launcher.c
>>> * Worker started and attached to our shmem. This check is safe
>>> * because only launcher ever starts the workers, so nobody can steal
>>> * the worker slot.
>>>
>>> The tablesync patch enabled even worker to start another worker.
>>> So the above assumption is not valid for now.
>>>
>>> This issue seems to cause the corner case where the launcher picks up
>>> the same worker slot that previously-started worker has already picked
>>> up to start another worker.
>>
>> I think what the comment should rather say is that workers are always
>> started through logicalrep_worker_launch() and worker slots are always
>> handed out while holding LogicalRepWorkerLock exclusively, so nobody can
>> steal the worker slot.
>>
>> Does that make sense?
>
> No unless I'm missing something.
>
> logicalrep_worker_launch() picks up unused worker slot (slot's proc == NULL)
> while holding LogicalRepWorkerLock. But it releases the lock before the slot
> is marked as used (i.e., slot is set to non-NULL). Then newly-launched worker
> calls logicalrep_worker_attach() and marks the slot as used.
>
> So if another logicalrep_worker_launch() starts after LogicalRepWorkerLock
> is released before the slot is marked as used, it can pick up the same slot
> because that slot looks unused.
>

Yeah I think it's less of a problem of that comment than the fact that
logicalrep_worker_launch isn't concurrency safe. We need in_use marker
for the workers and update it as needed instead of relying on pgproc.
I'll write up something over the weekend.

--
Petr Jelinek http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2017-04-14 21:54:19 Re: Cutting initdb's runtime (Perl question embedded)
Previous Message Magnus Hagander 2017-04-14 19:38:32 Re: [pgsql-www] Small issue in online devel documentation build