Quick Links

Re: SyncRepLock acquired exclusively in default configuration

From:	Fujii Masao <masao(dot)fujii(at)oss(dot)nttdata(dot)com>
To:	Masahiko Sawada <masahiko(dot)sawada(at)2ndquadrant(dot)com>
Cc:	Ashwin Agrawal <aagrawal(at)pivotal(dot)io>, Andres Freund <andres(at)anarazel(dot)de>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Simon Riggs <simon(at)2ndquadrant(dot)com>, Xin Zhang <xzhang(at)pivotal(dot)io>, Asim Praveen <apraveen(at)pivotal(dot)io>
Subject:	Re: SyncRepLock acquired exclusively in default configuration
Date:	2020-07-09 10:52:01
Message-ID:	8bff9d04-e3e6-0e83-1e3f-7268aa42cfaf@oss.nttdata.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

On 2020/05/19 11:41, Masahiko Sawada wrote:
> On Sat, 11 Apr 2020 at 09:30, Masahiko Sawada
> <masahiko(dot)sawada(at)2ndquadrant(dot)com> wrote:
>>
>> On Fri, 10 Apr 2020 at 21:52, Fujii Masao <masao(dot)fujii(at)oss(dot)nttdata(dot)com> wrote:
>>>
>>>
>>>
>>> On 2020/04/10 20:56, Masahiko Sawada wrote:
>>>> On Fri, 10 Apr 2020 at 18:57, Fujii Masao <masao(dot)fujii(at)oss(dot)nttdata(dot)com> wrote:
>>>>>
>>>>>
>>>>>
>>>>> On 2020/04/10 14:11, Masahiko Sawada wrote:
>>>>>> On Fri, 10 Apr 2020 at 13:20, Fujii Masao <masao(dot)fujii(at)oss(dot)nttdata(dot)com> wrote:
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On 2020/04/08 3:01, Ashwin Agrawal wrote:
>>>>>>>>
>>>>>>>> On Mon, Apr 6, 2020 at 2:14 PM Andres Freund <andres(at)anarazel(dot)de <mailto:andres(at)anarazel(dot)de>> wrote:
>>>>>>>>
>>>>>>>> > How about we change it to this ?
>>>>>>>>
>>>>>>>> Hm. Better. But I think it might need at least a compiler barrier /
>>>>>>>> volatile memory load? Unlikely here, but otherwise the compiler could
>>>>>>>> theoretically just stash the variable somewhere locally (it's not likely
>>>>>>>> to be a problem because it'd not be long ago that we acquired an lwlock,
>>>>>>>> which is a full barrier).
>>>>>>>>
>>>>>>>>
>>>>>>>> That's the part, I am not fully sure about. But reading the comment above SyncRepUpdateSyncStandbysDefined(), it seems fine.
>>>>>>>>
>>>>>>>> > Bring back the check which existed based on GUC but instead of just blindly
>>>>>>>> > returning based on just GUC not being set, check
>>>>>>>> > WalSndCtl->sync_standbys_defined. Thoughts?
>>>>>>>>
>>>>>>>> Hm. Is there any reason not to just check
>>>>>>>> WalSndCtl->sync_standbys_defined? rather than both !SyncStandbysDefined()
>>>>>>>> and WalSndCtl->sync_standbys_defined?
>>>>>>>>
>>>>>>>>
>>>>>>>> Agree, just checking for WalSndCtl->sync_standbys_defined seems fine.
>>>>>>>
>>>>>>> So the consensus is something like the following? Patch attached.
>>>>>>>
>>>>>>> /*
>>>>>>> - * Fast exit if user has not requested sync replication.
>>>>>>> + * Fast exit if user has not requested sync replication, or there are no
>>>>>>> + * sync replication standby names defined.
>>>>>>> */
>>>>>>> - if (!SyncRepRequested())
>>>>>>> + if (!SyncRepRequested() ||
>>>>>>> + !((volatile WalSndCtlData *) WalSndCtl)->sync_standbys_defined)
>>>>>>> return;
>>>>>>>
>>>>>>
>>>>>> I think we need more comments describing why checking
>>>>>> sync_standby_defined without SyncRepLock is safe here. For example:
>>>>>
>>>>> Yep, agreed!
>>>>>
>>>>>> This routine gets called every commit time. So, to check if the
>>>>>> synchronous standbys is defined as quick as possible we check
>>>>>> WalSndCtl->sync_standbys_defined without acquiring SyncRepLock. Since
>>>>>> we make this test unlocked, there's a change we might fail to notice
>>>>>> that it has been turned off and continue processing.
>>>>>
>>>>> Does this really happen? I was thinking that the problem by not taking
>>>>> the lock here is that SyncRepWaitForLSN() can see that shared flag after
>>>>> SyncRepUpdateSyncStandbysDefined() wakes up all the waiters and
>>>>> before it sets the flag to false. Then if SyncRepWaitForLSN() adds itself
>>>>> into the wait queue becaues the flag was true, without lock, it may keep
>>>>> sleeping infinitely.
>>>>
>>>> I think that because a backend does the following check after
>>>> acquiring SyncRepLock, in that case, once the backend has taken
>>>> SyncRepLock it can see that sync_standbys_defined is false and return.
>>>
>>> Yes, but the backend can see that sync_standby_defined indicates false
>>> whether holding SyncRepLock or not, after the checkpointer sets it to false.
>>>
>>>> But you meant that we do both checks without SyncRepLock?
>>>
>>> Maybe No. The change that the latest patch provides should be applied, I think.
>>> That is, sync_standbys_defined should be check without lock at first, then
>>> only if it's true, it should be checked again with lock.
>>
>> Yes. My understanding is the same.
>>
>> After applying your patch, SyncRepWaitForLSN() is going to become
>> something like:
>>
>> /*
>> * Fast exit if user has not requested sync replication, or there are no
>> * sync replication standby names defined.
>> */
>> if (!SyncRepRequested() ||
>> !((volatile WalSndCtlData *) WalSndCtl)->sync_standbys_defined)
>> return;
>>
>> Assert(SHMQueueIsDetached(&(MyProc->syncRepLinks)));
>> Assert(WalSndCtl != NULL);
>>
>> LWLockAcquire(SyncRepLock, LW_EXCLUSIVE);
>> Assert(MyProc->syncRepState == SYNC_REP_NOT_WAITING);
>>
>> /*
>> * We don't wait for sync rep if WalSndCtl->sync_standbys_defined is not
>> * set. See SyncRepUpdateSyncStandbysDefined.
>> *
>> * Also check that the standby hasn't already replied. Unlikely race
>> * condition but we'll be fetching that cache line anyway so it's likely
>> * to be a low cost check.
>> */
>> if (!WalSndCtl->sync_standbys_defined ||
>> lsn <= WalSndCtl->lsn[mode])
>> {
>> LWLockRelease(SyncRepLock);
>> return;
>> }
>>
>> /*
>> * Set our waitLSN so WALSender will know when to wake us, and add
>> * ourselves to the queue.
>> */
>> MyProc->waitLSN = lsn;
>> MyProc->syncRepState = SYNC_REP_WAITING;
>> SyncRepQueueInsert(mode);
>> Assert(SyncRepQueueIsOrderedByLSN(mode));
>> LWLockRelease(SyncRepLock);
>>
>> There are two checks of sync_standbys_defined. The first check is
>> performed without SyncRepLock and the second check is performed with
>> SyncRepLock. That's what you and I are expecting. Right?
>>
>>>
>>> ISTM that basically SyncRepLock is used in SyncRepWaitForLSN() and
>>> SyncRepUpdateSyncStandbysDefined() to make operation on the queue
>>> and enabling sync_standbys_defined atomic. Without lock, the issue that
>>> the comment in SyncRepUpdateSyncStandbysDefined() explains would
>>> happen. That is, the backend may keep waiting infinitely as follows.
>>>
>>
>> Let me think the following sequence after applying your changes:
>>
>>> 1. checkpointer calls SyncRepUpdateSyncStandbysDefined()
>>> 2. checkpointer sees that the flag indicates true but the config indicates false
>>> 3. checkpointer takes lock and wakes up all the waiters
>>> 4. backend calls SyncRepWaitForLSN() can see that the flag indicates true
>>
>> Yes, I suppose this is the first check of sync_standbys_defined.
>>
>> And before the second check, backend tries to acquire SyncRepLock but
>> since the lock is already being held by checkohpointer it must wait.
>>
>>> 5. checkpointer sets the flag to false and releases the lock
>>
>> After checkpointer release the lock, the backend is woken up.
>>
>>> 6. backend adds itself to the queue and wait until it's waken up, but will not happen immediately
>>
>> The backend sees that the flag has been false at the second check, so return.
>>
>> If we didn't acquire SyncRepLock even for the second check I think the
>> backend would keep waiting infinitely as you mentioned.
>>
>>>
>>> So after the backend sees that the flag indicates true without lock,
>>> it must check the flag again with lock immediately without operating
>>> the queue. If this my understanding is right, I was thinking that
>>> the comment should mention these things.
>>
>> I think that's right. I was going to describe why we do the first
>> check without SyncRepLock and why it is safe but it seems to me that
>> these things you mentioned are related to the second check, if I'm not
>> missing something.
>>
>>>
>>>> /*
>>>> * We don't wait for sync rep if WalSndCtl->sync_standbys_defined is not
>>>> * set. See SyncRepUpdateSyncStandbysDefined.
>>>> *
>>>> * Also check that the standby hasn't already replied. Unlikely race
>>>> * condition but we'll be fetching that cache line anyway so it's likely
>>>> * to be a low cost check.
>>>> */
>>>> if (!WalSndCtl->sync_standbys_defined ||
>>>> lsn <= WalSndCtl->lsn[mode])
>>>> {
>>>> LWLockRelease(SyncRepLock);
>>>> return;
>>>> }
>>>>
>>>>>
>>>>>> But since the
>>>>>> subsequent check will check it again while holding SyncRepLock, it's
>>>>>> no problem. Similarly even if we fail to notice that it has been
>>>>>> turned on
>>>>> Is this true? ISTM that after SyncRepUpdateSyncStandbysDefined()
>>>>> sets the flag to true, SyncRepWaitForLSN() basically doesn't seem
>>>>> to fail to notice that. No?
>>>>
>>>> What I wanted to say is, in the current code, while the checkpointer
>>>> process is holding SyncRepLock to turn off sync_standbys_defined,
>>>> backends who reach SyncRepWaitForLSN() wait for the lock. Then, after
>>>> the checkpointer process releases SyncRepLock these backend can
>>>> enqueue themselves to the wait queue because they can see that
>>>> sync_standbys_defined is turned on.
>>>
>>> In this case, since the checkpointer turned the flag off while holding
>>> the lock, the backend sees that the flag is turned off, and doesn't
>>> enqueue itself. No?
>>
>> Oops, I had mistake here. It should be "while the checkpointer process
>> is holding SyncRepLock to turn *on* sync_standbys_defined, ...".
>>
>>>
>>>> On the other hand if we do the
>>>> check without SyncRepLock, backends who reach SyncRepWaitForLSN() will
>>>> return instead of waiting, in spite of checkpointer process being
>>>> turning on sync_standbys_defined. Which means these backends are
>>>> failing to notice that it has been turned on, I thought.
>>>
>>> No. Or I'm missing something... In this case, the backend sees that
>>> the flag is turned on without lock since checkpointer turned it on.
>>> So you're thinking the following. Right?
>>>
>>> 1. sync_standbys_defined flag is false
>>> 2. checkpointer takes the lock and turns the flag on
>>> 3. backend sees the flag
>>> 4. checkpointer releases the lock
>>>
>>> In #3, the flag indicates true, I think. But you think it's false?
>>
>> I meant the backends who reach SyncRepLock() while checkpointer is at
>> after acquiring the lock but before turning the flag on, described at
>> #3 step in the following sequence. Such backends will wait for the
>> lock in the current code, but after applying the patch they return
>> quickly. So what I'm thinking is:
>>
>> 1. sync_standbys_defined flag is false
>> 2. checkpointer takes the lock
>> 3. backend sees the flag, and return as it's still false
>> 4. checkpointer turns the flag on
>> 5. checkpointer releases the lock
>>
>> If a backend reaches SyncRepWaitForLSN() between #4 and #5 it will
>> wait for the lock and then enqueue itself after acquiring the lock.
>> But such behavior is not changed before and after applying the patch.
>>
>
> Fujii-san, I think we agree on how to fix this issue and on the patch
> you proposed so please add your comments.

Sorry for the late reply...

Regarding how to fix, don't we need memory barrier when reading
sync_standbys_defined? Without that, after SyncRepUpdateSyncStandbysDefined()
updates it to true, SyncRepWaitForLSN() can see the previous value,
i.e., false, and then exit out of the function. Is this right?
If this is right, we need memory barrier to avoid this issue?

> This item is for PG14, right?

Yes!

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION

In response to

Re: SyncRepLock acquired exclusively in default configuration at 2020-05-19 02:41:12 from Masahiko Sawada

Responses

Re: SyncRepLock acquired exclusively in default configuration at 2020-08-11 11:55:05 from Asim Praveen

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Fujii Masao	2020-07-09 10:57:22	Re: Performing partition pruning using row value
Previous Message	Etsuro Fujita	2020-07-09 10:45:58	Re: Performing partition pruning using row value