From: | Fabrice Chapuis <fabrice636861(at)gmail(dot)com> |
---|---|
To: | PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org> |
Subject: | failover logical replication slots |
Date: | 2025-06-10 15:46:33 |
Message-ID: | CAA5-nLD0vKn6T1-OHROBNfN2Pxa17zVo4UoVBdfHn2y=7nKixA@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
I'm working with logical replication in a PostgreSQL 17 setup, and I'm
exploring the new option to make replication slots failover safe in a
highly available environment
using physical standby nodes managed by Patroni.
After a switchover, I encounter an error message in the PostgreSQL logs and
observe unexpected behavior.
Here are the different steps I followed:
1) Setting up a new subscription
Logical replication is established between two databases on the same
PostgreSQL instance.
A logical replication slot is created on the source database:
SELECT pg_create_logical_replication_slot('sub_test', 'pgoutput', false,
false, true);
A subscription is then configured on the target database:
CREATE SUBSCRIPTION sub_test CONNECTION 'dbname=test host=localhost
port=5432 user=user_test'
PUBLICATION pub_test WITH (create_slot=false, copy_data=false,
failover=true);
The logical replication slot is active and in failover mode.
\dRs+
List of subscriptions
+-[ RECORD 1
]-------+----------------------------------------------------------------------------+
| Name | sub_test
|
| Owner | postgres
|
| Enabled | t
|
| Publication | {pub_test}
|
| Binary | f
|
| Streaming | off
|
| Two-phase commit | d
|
| Disable on error | f
|
| Origin | any
|
| Password required | t
|
| Run as owner? | f
|
| Failover | t
|
| Synchronous commit | off
|
| Conninfo | dbname=test host=localhost port=5432 user=user_test
|
| Skip LSN | 0/0
|
+--------------------+----------------------------------------------------------------------------+
select * from pg_replication_slots where slot_type = 'logical';
+-[ RECORD 1 ]--------+----------------+
| slot_name | sub_test |
| plugin | pgoutput |
| slot_type | logical |
| datoid | 58458 |
| database | test |
| temporary | f |
| active | t |
| active_pid | 739313 |
| xmin | |
| catalog_xmin | 1976743 |
| restart_lsn | 8/5F000028 |
| confirmed_flush_lsn | 8/5F000060 |
| wal_status | reserved |
| safe_wal_size | |
| two_phase | f |
| inactive_since | |
| conflicting | f |
| invalidation_reason | |
| failover | t |
| synced | f |
+---------------------+----------------+
2) Starting the physical standby
A logical replication slot is successfully created on the standby
select * from pg_replication_slots where slot_type = 'logical';
+-[ RECORD 1 ]--------+-------------------------------+
| slot_name | sub_test |
| plugin | pgoutput |
| slot_type | logical |
| datoid | 58458 |
| database | test |
| temporary | f |
| active | f |
| active_pid | |
| xmin | |
| catalog_xmin | 1976743 |
| restart_lsn | 8/5F000028 |
| confirmed_flush_lsn | 8/5F000060 |
| wal_status | reserved |
| safe_wal_size | |
| two_phase | f |
| inactive_since | 2025-06-10 16:30:38.633723+02 |
| conflicting | f |
| invalidation_reason | |
| failover | t |
| synced | t |
+---------------------+-------------------------------+
3) Cluster switchover
The switchover is initiated using the Patroni command:
patronictl switchover
The operation completes successfully, and roles are reversed in the cluster.
4) Issue encountered
After the switchover, an error appears in the PostgreSQL logs:
2025-06-10 16:40:58.996 CEST [739829]: [1-1] user=,db=,client=,application=
LOG: slot sync worker started
2025-06-10 16:40:59.011 CEST [739829]: [2-1] user=,db=,client=,application=
ERROR: exiting from slot synchronization because same name slot "sub_test"
already exists on the standby
the slot on the new standby in not in sync mode.
select * from pg_replication_slots where slot_type = 'logical';
+-[ RECORD 1 ]--------+-------------------------------+
| slot_name | sub_test |
| plugin | pgoutput |
| slot_type | logical |
| datoid | 58458 |
| database | test |
| temporary | f |
| active | f |
| active_pid | |
| xmin | |
| catalog_xmin | 1976743 |
| restart_lsn | 8/5F000080 |
| confirmed_flush_lsn | 8/5F000130 |
| wal_status | reserved |
| safe_wal_size | |
| two_phase | f |
| inactive_since | 2025-06-10 16:33:49.573016+02 |
| conflicting | f |
| invalidation_reason | |
| failover | t |
| synced | f |
+---------------------+-------------------------------+
In the source code (slotsync.c), the check for the synced flag triggers an
error:
/* Search for the named slot */
if ((slot = SearchNamedReplicationSlot(remote_slot->name, true))) {
bool synced;
SpinLockAcquire(&slot->mutex);
synced = slot->data.synced;
SpinLockRelease(&slot->mutex);
/* A user-created slot with the same name exists → raise ERROR */
if (!synced)
ereport(ERROR,
errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
errmsg("exiting from slot synchronization because same"
" name slot \"%s\" already exists on the standby",
remote_slot->name));
}
5) Dropping the slot
If the slot on the standby is deleted, it is then recreated with synced =
true, and at that point, it successfully resynchronizes with the primary
slot. Everything works correctly.
Question:
Why does the synced flag fail to change to true, even though
sync_replication_slots is enabled (on)?
Thanks for helping
Fabrice
From | Date | Subject | |
---|---|---|---|
Next Message | Naga Appani | 2025-06-10 15:47:13 | Re: [PATCH v1] Add pg_stat_multixact view for multixact membership usage monitoring |
Previous Message | Nathan Bossart | 2025-06-10 15:38:29 | Re: add function for creating/attaching hash table in DSM registry |