Re: SyncRepLock acquired exclusively in default configuration

From: Fujii Masao <masao(dot)fujii(at)oss(dot)nttdata(dot)com>
To: Masahiko Sawada <masahiko(dot)sawada(at)2ndquadrant(dot)com>, Asim Praveen <pasim(at)vmware(dot)com>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, Ashwin Agrawal <ashwinstar(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Simon Riggs <simon(at)2ndquadrant(dot)com>, "Xin (Shin) Zhang (Pivotal)" <xzhang(at)pivotal(dot)io>
Subject: Re: SyncRepLock acquired exclusively in default configuration
Date: 2020-08-19 12:41:03
Message-ID: 3273455e-acd6-fe2f-8136-8013e2a475b8@oss.nttdata.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 2020/08/12 15:32, Masahiko Sawada wrote:
> On Wed, 12 Aug 2020 at 14:06, Asim Praveen <pasim(at)vmware(dot)com> wrote:
>>
>>
>>
>>> On 11-Aug-2020, at 8:57 PM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
>>>
>>> I think this gets to the root of the issue. If we check the flag
>>> without a lock, we might see a slightly stale value. But, considering
>>> that there's no particular amount of time within which configuration
>>> changes are guaranteed to take effect, maybe that's OK. However, there
>>> is one potential gotcha here: if the walsender declares the standby to
>>> be synchronous, a user can see that, right? So maybe there's this
>>> problem: a user sees that the standby is synchronous and expects a
>>> transaction committing afterward to provoke a wait, but really it
>>> doesn't. Now the user is unhappy, feeling that the system didn't
>>> perform according to expectations.
>>
>> Yes, pg_stat_replication reports a standby in sync as soon as walsender updates priority of the standby to something other than 0.
>>
>> The potential gotcha referred above doesn’t seem too severe. What is the likelihood of someone setting synchronous_standby_names GUC with either “*” or a standby name and then immediately promoting that standby? If the standby is promoted before the checkpointer on master gets a chance to update sync_standbys_defined in shared memory, commits made during this interval on master may not make it to standby. Upon promotion, those commits may be lost.
>
> I think that if the standby is quite behind the primary and in case of
> the primary crashes, the likelihood of losing commits might get
> higher. The user can see the standby became synchronous standby via
> pg_stat_replication but commit completes without a wait because the
> checkpointer doesn't update sync_standbys_defined yet. If the primary
> crashes before standby catching up and the user does failover, the
> committed transaction will be lost, even though the user expects that
> transaction commit has been replicated to the standby synchronously.
> And this can happen even without the patch, right?

I think you're right. This issue can happen even without the patch.

Maybe we should not mark the standby as "sync" whenever sync_standbys_defined
is false even if synchronous_standby_names is actually set and walsenders have
already detect that? Or we need more aggressive approach;
make the checkpointer update sync_standby_priority values of
all the walsenders? ISTM that the latter looks overkill...

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Hamid Akhtar 2020-08-19 12:45:41 Re: track_planning causing performance regression
Previous Message Rahila Syed 2020-08-19 12:03:36 Re: More tests with USING INDEX replident and dropped indexes