Re: SyncRepLock acquired exclusively in default configuration

From: Masahiko Sawada <masahiko(dot)sawada(at)2ndquadrant(dot)com>
To: Fujii Masao <masao(dot)fujii(at)oss(dot)nttdata(dot)com>
Cc: Ashwin Agrawal <aagrawal(at)pivotal(dot)io>, Andres Freund <andres(at)anarazel(dot)de>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Simon Riggs <simon(at)2ndquadrant(dot)com>, Xin Zhang <xzhang(at)pivotal(dot)io>, Asim Praveen <apraveen(at)pivotal(dot)io>
Subject: Re: SyncRepLock acquired exclusively in default configuration
Date: 2020-04-10 11:56:57
Message-ID: CA+fd4k7+3nta_3RsGZnfVZt1W4khvVsoSZShVO_ZjWnf_crAqg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Fri, 10 Apr 2020 at 18:57, Fujii Masao <masao(dot)fujii(at)oss(dot)nttdata(dot)com> wrote:
>
>
>
> On 2020/04/10 14:11, Masahiko Sawada wrote:
> > On Fri, 10 Apr 2020 at 13:20, Fujii Masao <masao(dot)fujii(at)oss(dot)nttdata(dot)com> wrote:
> >>
> >>
> >>
> >> On 2020/04/08 3:01, Ashwin Agrawal wrote:
> >>>
> >>> On Mon, Apr 6, 2020 at 2:14 PM Andres Freund <andres(at)anarazel(dot)de <mailto:andres(at)anarazel(dot)de>> wrote:
> >>>
> >>> > How about we change it to this ?
> >>>
> >>> Hm. Better. But I think it might need at least a compiler barrier /
> >>> volatile memory load? Unlikely here, but otherwise the compiler could
> >>> theoretically just stash the variable somewhere locally (it's not likely
> >>> to be a problem because it'd not be long ago that we acquired an lwlock,
> >>> which is a full barrier).
> >>>
> >>>
> >>> That's the part, I am not fully sure about. But reading the comment above SyncRepUpdateSyncStandbysDefined(), it seems fine.
> >>>
> >>> > Bring back the check which existed based on GUC but instead of just blindly
> >>> > returning based on just GUC not being set, check
> >>> > WalSndCtl->sync_standbys_defined. Thoughts?
> >>>
> >>> Hm. Is there any reason not to just check
> >>> WalSndCtl->sync_standbys_defined? rather than both !SyncStandbysDefined()
> >>> and WalSndCtl->sync_standbys_defined?
> >>>
> >>>
> >>> Agree, just checking for WalSndCtl->sync_standbys_defined seems fine.
> >>
> >> So the consensus is something like the following? Patch attached.
> >>
> >> /*
> >> - * Fast exit if user has not requested sync replication.
> >> + * Fast exit if user has not requested sync replication, or there are no
> >> + * sync replication standby names defined.
> >> */
> >> - if (!SyncRepRequested())
> >> + if (!SyncRepRequested() ||
> >> + !((volatile WalSndCtlData *) WalSndCtl)->sync_standbys_defined)
> >> return;
> >>
> >
> > I think we need more comments describing why checking
> > sync_standby_defined without SyncRepLock is safe here. For example:
>
> Yep, agreed!
>
> > This routine gets called every commit time. So, to check if the
> > synchronous standbys is defined as quick as possible we check
> > WalSndCtl->sync_standbys_defined without acquiring SyncRepLock. Since
> > we make this test unlocked, there's a change we might fail to notice
> > that it has been turned off and continue processing.
>
> Does this really happen? I was thinking that the problem by not taking
> the lock here is that SyncRepWaitForLSN() can see that shared flag after
> SyncRepUpdateSyncStandbysDefined() wakes up all the waiters and
> before it sets the flag to false. Then if SyncRepWaitForLSN() adds itself
> into the wait queue becaues the flag was true, without lock, it may keep
> sleeping infinitely.

I think that because a backend does the following check after
acquiring SyncRepLock, in that case, once the backend has taken
SyncRepLock it can see that sync_standbys_defined is false and return.
But you meant that we do both checks without SyncRepLock?

/*
* We don't wait for sync rep if WalSndCtl->sync_standbys_defined is not
* set. See SyncRepUpdateSyncStandbysDefined.
*
* Also check that the standby hasn't already replied. Unlikely race
* condition but we'll be fetching that cache line anyway so it's likely
* to be a low cost check.
*/
if (!WalSndCtl->sync_standbys_defined ||
lsn <= WalSndCtl->lsn[mode])
{
LWLockRelease(SyncRepLock);
return;
}

>
> > But since the
> > subsequent check will check it again while holding SyncRepLock, it's
> > no problem. Similarly even if we fail to notice that it has been
> > turned on
> Is this true? ISTM that after SyncRepUpdateSyncStandbysDefined()
> sets the flag to true, SyncRepWaitForLSN() basically doesn't seem
> to fail to notice that. No?

What I wanted to say is, in the current code, while the checkpointer
process is holding SyncRepLock to turn off sync_standbys_defined,
backends who reach SyncRepWaitForLSN() wait for the lock. Then, after
the checkpointer process releases SyncRepLock these backend can
enqueue themselves to the wait queue because they can see that
sync_standbys_defined is turned on. On the other hand if we do the
check without SyncRepLock, backends who reach SyncRepWaitForLSN() will
return instead of waiting, in spite of checkpointer process being
turning on sync_standbys_defined. Which means these backends are
failing to notice that it has been turned on, I thought.

Regards,

--
Masahiko Sawada http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Amit Kapila 2020-04-10 12:03:06 Re: PG compilation error with Visual Studio 2015/2017/2019
Previous Message Robert Haas 2020-04-10 11:40:06 Re: Parallel copy