| From: | Ashutosh Sharma <ashu(dot)coek88(at)gmail(dot)com> |
|---|---|
| To: | SATYANARAYANA NARLAPURAM <satyanarlapuram(at)gmail(dot)com> |
| Cc: | PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org> |
| Subject: | Re: synchronized_standby_slots behavior inconsistent with quorum-based synchronous replication |
| Date: | 2026-02-26 04:58:31 |
| Message-ID: | CAE9k0Pm_6+4zW-X9zgBHhyLa9dqNKLM=zzUnVeH+ikoh45iALw@mail.gmail.com |
| Views: | Whole Thread | Raw Message | Download mbox | Resend email |
| Thread: | |
| Lists: | pgsql-hackers |
Hi,
On Wed, Feb 25, 2026 at 7:21 PM Ashutosh Sharma <ashu(dot)coek88(at)gmail(dot)com> wrote:
>
> Hi Satya,
>
> On Wed, Feb 25, 2026 at 3:38 AM SATYANARAYANA NARLAPURAM
> <satyanarlapuram(at)gmail(dot)com> wrote:
> >
> >
> > Hi hackers,
> >
> > synchronized_standby_slots requires that every physical slot listed in the GUC has caught up before a logical failover slot is allowed to proceed with decoding. This is an ALL-of-N slots semantic. The logical slot availability model does not align with quorum replication semantics set using synchronous_standby_names which can be configured for quorum commit (ANY M of N).
> >
> > In a typical 3 Node HA deployment with quorum sync rep:
> >
> > Primary, standby1 (corresponds to sb1_slot), standby2 (corresponds to sb2_slot)
> > synchronized_standby_slots = ' sb1_slot, sb2_slot'
> > synchronous_standby_names = 'Any 1 ('standby1','standby2')'
> >
> > If standby1 goes down, synchronous commits still succeed because standby2 satisfies the quorum. However, logical decoding blocks indefinitely in WaitForStandbyConfirmation(), waiting for sb1_slot (corresponds to standby1) to catch up — even though the transaction is already safely committed on a quorum of synchronous standbys. This blocks logical decoding consumers from progressing and is inconsistent with the availability guarantee the DBA intended by choosing quorum commit.
>
> +1. This can indeed be a blocker for failover enabled logical
> replication. It not only has the potential to disrupt logical
> replication, but can also impact the primary server. Over time, it may
> silently lead to significant WAL accumulation on the primary,
> eventually causing disk-full scenarios and degrading the performance
> of applications running on the primary instance. Therefore, I too
> strongly believe this needs to be addressed to prevent such
> potentially disruptive situations.
>
> >
> >
> > Proposal:
> >
> > Make synchronized_standby_slots quorum aware i.e. extend the GUC to accept an ANY M (slot1, slot2, ...) syntax similar to synchronous_standby_names, so StandbySlotsHaveCaughtup() can return true when M of N slots (where M <= N and M >= 1) have caught up. I still prefer two different GUCs for this as the list of slots to be synchronized can still be different (for example, DBA may want to ensure Geo standby to be sync before allowing the logical decoding client to read the changes). I kept synchronized_standby_slots parse logic similar to synchronous_standby_names to keep things simple. The default behavior is also not changed for synchronized_standby_slots.
> >
>
> Thank you for the proposal. I can spend some time reviewing the
> changes and help take this forward. I would also be happy to hear
> others' thoughts and feedback on the proposal.
>
Thinking about this further, using quorum settings for
synchronized_standby_slots can/will certainly result in at least one
sync standby lagging behind the logical replica, making it probably
impossible to continue with the existing logical replication setup
after a failover to the standby that lags behind. Here is what I am
mean:
Let's say we have 2 synchronous standbys with
"synchronized_standby_slots" configured as ANY 1 (sync_standby1,
sync_standby2). With this quorum setting, WAL only needs to be
confirmed by any one of the two standbys before it can be forwarded to
the logical replica. Now consider a scenario where sync_standby1 is
ahead of sync_standby2, new WAL gets confirmed by sync_standby1 and
subsequently delivered to the logical replica. If sync_standby1 then
goes down and we failover to sync_standby2, the new primary will be at
a lower LSN than the logical replica, since sync_standby2 never
received that WAL. At this point, the logical replication slot on the
new primary is essentially stale, and the logical replication setup
that existed before the failover cannot be resumed. Hence, I think
it's important to ensure that the WAL (including all the necessary
data needed for logical replication) gets delivered to all the
servers/slots specified in synchronized_standby_slots before it gets
delivered to the logical replica.
While I agree that not allowing quorum like settings for this has the
potential to accumulate WAL and impact logical replication, I think we
can explore other ways to mitigate that concern separately.
Let's see what experts have to say on this.
--
With Regards,
Ashutosh Sharma.
| From | Date | Subject | |
|---|---|---|---|
| Next Message | Chao Li | 2026-02-26 05:07:06 | Re: Show comments in \dRp+, \dRs+, and \dX+ psql meta-commands |
| Previous Message | David Rowley | 2026-02-26 04:57:29 | Re: Partial Mode in Aggregate Functions |