From: | Jeremy Schneider <schneider(at)ardentperf(dot)com> |
---|---|
To: | "pgsql-hackers(at)lists(dot)postgresql(dot)org" <pgsql-hackers(at)lists(dot)postgresql(dot)org> |
Subject: | Re: sync_standbys_defined and pg_stat_replication |
Date: | 2025-10-08 07:06:12 |
Message-ID: | 20251008000612.437a2333@ardentperf.com |
Views: | Whole Thread | Raw Message | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
On Mon, 6 Oct 2025 22:59:33 -0700
Jeremy Schneider <schneider(at)ardentperf(dot)com> wrote:
> For failover to work correctly, if someone changes the GUC
> synchronous_standby_names to enable sync replication, then we need to
> understand the exact moment when backends will begin to block in order
> to correctly determine when we can failover without data loss.
>
> There's an older mailing list thread that discusses one aspect of this
>
> https://www.postgresql.org/message-id/flat/CABrsG8j3kPD%2Bkbbsx_isEpFvAgaOBNGyGpsqSjQ6L8vwVUaZAQ%40mail.gmail.com
>
> I've also gone through the code for SyncRepWaitForLSN() and worked
> backwards to where the checkpointer sets sync_standbys_defined. But I
> have a question which I couldn't answer so far.
>
> It looks like sync_standbys_defined is only updated by the
> checkpointer process. Is there a short period of time where the
> pg_stat_replication view would show sync_state=sync and
> state=streaming, but the checkpointer has not yet updated
> sync_standbys_defined?
>
> I'm wondering if this is a race condition where COMMITs are not being
> blocked for replication but external tools which rely on
> pg_stat_replication would think it's safe to failover with zero data
> loss?
FYI - some more details on the background of my question are here
https://github.com/cloudnative-pg/cloudnative-pg/issues/8790
I'm running Jepsen tests of a new CNPG feature (quorum failover) and
Jepsen picked up data loss when I ran it in conjuction with CNPG's
"preferred" dataDurability setting and I'm theorizing it may be related
to this delay with SyncRepWaitForLSN() starting to block COMMITs. The
"preferred durability" configuration is the equivalent to "Max
Availability" mode with Oracle Data Guard Broker; if anyone is curious I
have a table comparing Oracle modes to patroni/cnpg configs in this
blog:
https://ardentperf.com/2025/10/05/testing-cloudnativepg-preferred-data-durability/
-Jeremy
From | Date | Subject | |
---|---|---|---|
Next Message | Tatsuo Ishii | 2025-10-08 08:00:07 | Re: Questionable result from lead(0) IGNORE NULLS |
Previous Message | shveta malik | 2025-10-08 06:37:04 | Re: Logical Replication of sequences |