Re: SyncRepLock acquired exclusively in default configuration

From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Asim Praveen <pasim(at)vmware(dot)com>
Cc: Fujii Masao <masao(dot)fujii(at)oss(dot)nttdata(dot)com>, Masahiko Sawada <masahiko(dot)sawada(at)2ndquadrant(dot)com>, Ashwin Agrawal <ashwinstar(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Simon Riggs <simon(at)2ndquadrant(dot)com>, "Xin (Shin) Zhang (Pivotal)" <xzhang(at)pivotal(dot)io>
Subject: Re: SyncRepLock acquired exclusively in default configuration
Date: 2020-08-11 15:27:16
Message-ID: CA+TgmoYsu0t4fpLttQK7JUth92OFHjHnJ1Z+uCm0id6D6PGZbQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Tue, Aug 11, 2020 at 7:55 AM Asim Praveen <pasim(at)vmware(dot)com> wrote:
> There is no out-of-order execution hazard in the scenario you are describing, memory barriers don’t seem to fit. Using locks to synchronise checkpointer process and a committing backend process is the right way. We have made a conscious decision to bypass the lock, which looks correct in this case.

Yeah, I am not immediately seeing why a memory barrier would help anything here.

> As an aside, there is a small (?) window where a change to synchronous_standby_names GUC is partially propagated among committing backends, checkpointer and walsender. Such a window may result in walsender declaring a standby as synchronous while a commit backend fails to wait for it in SyncRepWaitForLSN. The root cause is walsender uses sync_standby_priority, a per-walsender variable to tell if a standby is synchronous. It is updated when walsender processes a config change. Whereas sync_standbys_defined, a variable updated by checkpointer, is used by committing backends to determine if they need to wait. If checkpointer is busy flushing buffers, it may take longer than walsender to reflect a change in sync_standbys_defined. This is a low impact problem, should be ok to live with it.

I think this gets to the root of the issue. If we check the flag
without a lock, we might see a slightly stale value. But, considering
that there's no particular amount of time within which configuration
changes are guaranteed to take effect, maybe that's OK. However, there
is one potential gotcha here: if the walsender declares the standby to
be synchronous, a user can see that, right? So maybe there's this
problem: a user sees that the standby is synchronous and expects a
transaction committing afterward to provoke a wait, but really it
doesn't. Now the user is unhappy, feeling that the system didn't
perform according to expectations.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2020-08-11 15:33:52 Re: posgres 12 bug (partitioned table)
Previous Message Tom Lane 2020-08-11 15:22:49 Re: Can I test Extended Query in core test framework