Re: sync_standbys_defined and pg_stat_replication

From: Ants Aasma <ants(dot)aasma(at)cybertec(dot)at>
To: Jeremy Schneider <schneider(at)ardentperf(dot)com>
Cc: "pgsql-hackers(at)lists(dot)postgresql(dot)org" <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: sync_standbys_defined and pg_stat_replication
Date: 2025-10-08 17:38:53
Message-ID: CANwKhkMTkckNSoz0kUYFvaMDA_6g2uuABcvhoowu64e9e+op=A@mail.gmail.com
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Tue, 7 Oct 2025 at 08:59, Jeremy Schneider <schneider(at)ardentperf(dot)com> wrote:
> For failover to work correctly, if someone changes the GUC
> synchronous_standby_names to enable sync replication, then we need to
> understand the exact moment when backends will begin to block in order
> to correctly determine when we can failover without data loss.

There is an early out in SyncRepWaitForLSN() when
WalSndCtl->sync_standbys_status has SYNC_STANDBY_DEFINED unset. That
flag gets set by the checkpointer in
SyncRepUpdateSyncStandbysDefined() via CheckpointWriteDelay() among
other places. But only when it's not executing a fast checkpoint or
it's not behind on checkpoints.

In other words, synchronous_standby_names will not become effective
until checkpointer has some downtime. While this is a small problem on
its own, there is no way to check if this has happened or not.

For the config update getting delayed the fix seems simple - just do
the config update unconditionally. Patch attached.

For the other problem, my thinking is to provide a new function that
allows a user to check if synchronous replication is active.

Ideally this function would give other information also needed by
cluster managers. Specifically when a replica is removed from
synchronous standby names we would need still need to consider that
replica as a potential synchronous replica until a quorum matching the
current synchronous_standby_names setting overtakes the last LSN
confirmed by a replica matching the removed name.

To illustrate the situation where this is needed, consider s_s_n =
'ANY 1 (A B)'. While this setting is active we have to check latest
replicated LSN from both A and B to know which one to promote. Lets
say transaction X is replicated to A and confirmed, but not yet to B.
Now A is removed so s_s_n becomes 'ANY 1 (B)'. Based on this setting
it is always safe to promote B, but until B receives the LSN that was
on primary when synchronous_standby_names was changed, it might not
have all the data. This one is possible to work around by checking the
relevant values from pg_stat_replication, but it would be nice to have
a neater interface.

My proposal is something like this:

postgres=# SELECT * FROM pg_sync_replication_status();
is_active | synchronous_standby_names | has_quorum
-----------+---------------------------+------------
t | ANY 1 (A B) | f
(1 row)

Thoughts?

Regards,
Ants Aasma

Attachment Content-Type Size
v1-0001-Update-checkpointer-config-even-during-fast-check.patch text/x-patch 1.5 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Sami Imseih 2025-10-08 17:47:50 Re: another autovacuum scheduling thread
Previous Message Melanie Plageman 2025-10-08 17:37:35 Re: Fix overflow of nbatch