From: | Ants Aasma <ants(dot)aasma(at)cybertec(dot)at> |
---|---|
To: | Jeremy Schneider <schneider(at)ardentperf(dot)com> |
Cc: | "pgsql-hackers(at)lists(dot)postgresql(dot)org" <pgsql-hackers(at)lists(dot)postgresql(dot)org> |
Subject: | Re: sync_standbys_defined and pg_stat_replication |
Date: | 2025-10-08 17:38:53 |
Message-ID: | CANwKhkMTkckNSoz0kUYFvaMDA_6g2uuABcvhoowu64e9e+op=A@mail.gmail.com |
Views: | Whole Thread | Raw Message | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
On Tue, 7 Oct 2025 at 08:59, Jeremy Schneider <schneider(at)ardentperf(dot)com> wrote:
> For failover to work correctly, if someone changes the GUC
> synchronous_standby_names to enable sync replication, then we need to
> understand the exact moment when backends will begin to block in order
> to correctly determine when we can failover without data loss.
There is an early out in SyncRepWaitForLSN() when
WalSndCtl->sync_standbys_status has SYNC_STANDBY_DEFINED unset. That
flag gets set by the checkpointer in
SyncRepUpdateSyncStandbysDefined() via CheckpointWriteDelay() among
other places. But only when it's not executing a fast checkpoint or
it's not behind on checkpoints.
In other words, synchronous_standby_names will not become effective
until checkpointer has some downtime. While this is a small problem on
its own, there is no way to check if this has happened or not.
For the config update getting delayed the fix seems simple - just do
the config update unconditionally. Patch attached.
For the other problem, my thinking is to provide a new function that
allows a user to check if synchronous replication is active.
Ideally this function would give other information also needed by
cluster managers. Specifically when a replica is removed from
synchronous standby names we would need still need to consider that
replica as a potential synchronous replica until a quorum matching the
current synchronous_standby_names setting overtakes the last LSN
confirmed by a replica matching the removed name.
To illustrate the situation where this is needed, consider s_s_n =
'ANY 1 (A B)'. While this setting is active we have to check latest
replicated LSN from both A and B to know which one to promote. Lets
say transaction X is replicated to A and confirmed, but not yet to B.
Now A is removed so s_s_n becomes 'ANY 1 (B)'. Based on this setting
it is always safe to promote B, but until B receives the LSN that was
on primary when synchronous_standby_names was changed, it might not
have all the data. This one is possible to work around by checking the
relevant values from pg_stat_replication, but it would be nice to have
a neater interface.
My proposal is something like this:
postgres=# SELECT * FROM pg_sync_replication_status();
is_active | synchronous_standby_names | has_quorum
-----------+---------------------------+------------
t | ANY 1 (A B) | f
(1 row)
Thoughts?
Regards,
Ants Aasma
Attachment | Content-Type | Size |
---|---|---|
v1-0001-Update-checkpointer-config-even-during-fast-check.patch | text/x-patch | 1.5 KB |
From | Date | Subject | |
---|---|---|---|
Next Message | Sami Imseih | 2025-10-08 17:47:50 | Re: another autovacuum scheduling thread |
Previous Message | Melanie Plageman | 2025-10-08 17:37:35 | Re: Fix overflow of nbatch |