Issue with logical replication slot during switchover

From: Fabrice Chapuis <fabrice636861(at)gmail(dot)com>
To: PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Issue with logical replication slot during switchover
Date: 2025-08-07 13:20:04
Message-ID: CAA5-nLAqGpBFEAr2XNYMj3E+39caQra_SJeB5MCtp7PCyLTiOg@mail.gmail.com
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

An issue occurred during the initial switchover using PostgreSQL version
17.5. The setup consists of a cluster with two nodes, managed by Patroni
version 4.0.5.
Logical replication is configured on the same instance, and the new feature
enabling logical replication slots to be failover-safe in a highly
available environment is used. Logical slot management is currently
disabled in Patroni.

Following are some screen captured during the swichover

1. Run the switchover with Patroni

patronictl switchover

Current cluster topology

+ Cluster: ClusterX (7529893278186104053) ----+----+-----------+

| Member | Host | Role | State | TL | Lag in MB |

+----------+--------------+---------+-----------+----+-----------+

| node_1 | xxxxxxxxxxxx | Leader | running | 4 | |

| node_2 | xxxxxxxxxxxx | Replica | streaming | 4 | 0 |

+----------+--------------+---------+-----------+----+-----------+
2. Check the slot on the new Primary

select * from pg_replication_slots where slot_type = 'logical';
+-[ RECORD 1 ]--------+----------------+
| slot_name | logical_slot |
| plugin | pgoutput |
| slot_type | logical |
| datoid | 25605 |
| database | db_test |
| temporary | f |
| active | t |
| active_pid | 3841546 |
| xmin | |
| catalog_xmin | 10399 |
| restart_lsn | 0/37002410 |
| confirmed_flush_lsn | 0/37002448 |
| wal_status | reserved |
| safe_wal_size | |
| two_phase | f |
| inactive_since | |
| conflicting | f |
| invalidation_reason | |
| failover | t |
| synced | t |
+---------------------+----------------+
Logical replication is active again after the promote

3. Check the slot on the new standby
select * from pg_replication_slots where slot_type = 'logical';
+-[ RECORD 1 ]--------+-------------------------------+
| slot_name | logical_slot |
| plugin | pgoutput |
| slot_type | logical |
| datoid | 25605 |
| database | db_test |
| temporary | f |
| active | f |
| active_pid | |
| xmin | |
| catalog_xmin | 10397 |
| restart_lsn | 0/3638F5F0 |
| confirmed_flush_lsn | 0/3638F6A0 |
| wal_status | reserved |
| safe_wal_size | |
| two_phase | f |
| inactive_since | 2025-08-05 10:21:03.342587+02 |
| conflicting | f |
| invalidation_reason | |
| failover | t |
| synced | f |
+---------------------+---------------------------

The synced flag keep value false.
Following error in in the log
2025-06-10 16:40:58.996 CEST [739829]: [1-1] user=,db=,client=,application=
LOG: slot sync worker started
2025-06-10 16:40:59.011 CEST [739829]: [2-1] user=,db=,client=,application=
ERROR: exiting from slot synchronization because same name slot
"logical_slot" already exists on the standby

I would like to make a proposal to address the issue:
Since the logical slot is in a failover state on both the primary and the
standby, an attempt could be made to resynchronize them.
I modify the slotsync.c module
+++ b/src/backend/replication/logical/slotsync.c
@@ -649,24 +649,46 @@ synchronize_one_slot(RemoteSlot *remote_slot, Oid
remote_dbid)

return false;
}
-
- /* Search for the named slot */
+ // Both local and remote slot have the same name
if ((slot = SearchNamedReplicationSlot(remote_slot->name, true)))
{
bool synced;
+ bool failover_status = remote_slot->failover;

SpinLockAcquire(&slot->mutex);
synced = slot->data.synced;
SpinLockRelease(&slot->mutex);
+
+ if (!synced){
+
+ Assert(!MyReplicationSlot);
+
+ if (failover_status){
+
+ ReplicationSlotAcquire(remote_slot->name,
true, true);
+
+ // Put the synced flag to true to attempt
resynchronizing failover slot on the standby
+ MyReplicationSlot->data.synced = true;
+
+ ReplicationSlotMarkDirty();

- /* User-created slot with the same name exists, raise
ERROR. */
- if (!synced)
- ereport(ERROR,
+ ReplicationSlotRelease();
+
+ /* Get rid of a replication slot that is no
longer wanted */
+ ereport(WARNING,
+
errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+ errmsg("slot \"%s\" local slot has
the same name as remote slot and they are in failover mode, try to
synchronize them",
+ remote_slot->name));
+ return false; /* Going back to the main
loop after droping the failover slot */
+ }
+ else
+ /* User-created slot with the same name
exists, raise ERROR. */
+ ereport(ERROR,

errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
errmsg("exiting from slot
synchronization because same"
- " name slot \"%s\"
already exists on the standby",
- remote_slot->name));
-
+ " name slot \"%s\"
already exists on the standby",
+
remote_slot->name));
+ }
/*
* The slot has been synchronized before.
*
This message follows the discussions started in this thread:
https://www.postgresql.org/message-id/CAA5-nLDvnqGtBsKu4T_s-cS%2BdGbpSLEzRwgep1XfYzGhQ4o65A%40mail.gmail.com

Help would be appreciated to move this point forward

Best regards,

Fabrice

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Alexander Korotkov 2025-08-07 13:32:41 Re: Proposal: Limitations of palloc inside checkpointer
Previous Message Xuneng Zhou 2025-08-07 13:01:13 Re: Proposal: Limitations of palloc inside checkpointer