From: | Chao Li <li(dot)evan(dot)chao(at)gmail(dot)com> |
---|---|
To: | Postgres hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org> |
Subject: | Allow logical replication in the same cluster |
Date: | 2025-09-05 02:50:59 |
Message-ID: | CAEoWx2kz1bSEaPzuqhe2=bcdtD5R7tBh8KhRCLqHFLrbdMaNqg@mail.gmail.com |
Views: | Whole Thread | Raw Message | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
Hi Hacker,
I was doing some test about logical replication a few days ago. When I
tried to setup a logical replication on my Macbook.
The basic workflow is simple:
```
Step 1: edit postgresql.conf and set:
wal_level = logical
max_replication_slots = 4
max_wal_senders = 4
Step 2: create two databases for pub and sub
% createdb pubdb
% createdb subdb
Step 3: create a table in pubdb, and create a publication
pubdb=# CREATE TABLE t (id int primary key, data text);
CREATE TABLE
pubdb=# INSERT INTO t VALUES (1, 'hello from pub');
INSERT 0 1
pubdb=# CREATE PUBLICATION mypub FOR TABLE t;
CREATE PUBLICATION
Step 4: create the same table in subdb
subdb=# CREATE TABLE t (id int primary key, data text);
CREATE TABLE
Step 5: create subscription in subdb
subdb=# CREATE SUBSCRIPTION mysub CONNECTION 'host=localhost dbname=pubdb'
PUBLICATION mypub; <==== stuck here
```
In step 5, "CREATE SUBSCRIPTION" got stuck. Then I found the issue had been
discussed with [1] in 2017, but no more effort had been spent resolving the
issue.
Then I investigated the root cause. Feels like this is a architectural
problem. Because both pubdb and subdb run in the same cluster, so they
share the same transaction id serial.
In step 5, when subdb "CREATE SUBSCRIPTION", say the transaction id is 100,
what the backend worker process does is like:
1) start a xact (100)
2) insert a tuple into pg_subscription
3) request pub side to create a sub slot and wait for the result
4) commit
When the pub side receives the request to create a replication slot, it
needs to check no running transactions. However, xact 100 is running and
waiting for replication slot creation to finish. This is a deadlock, and
the deadlock exists only when pub and sub are in the same cluster.
To resolve the problem, the key is to let pub side know sub side is local
within the same cluster.
From sub side, "CREATE SUBSCRIPTION" statement uses a connection string to
specify where is publisher. It is impossible to decide if pub is within the
same cluster from the connection string.
From pub side, sub side uses libpq to send a "CREATE REPLICATION SLOT"
command to pub side, without modifying libpq, there is no way to let the
command carry more information. So pub side has no way to know if sub is in
the same cluster from the command.
So, I decided an "opt-in" solution. We can add a "local=true" option, so
sub side runs:
```
subdb=# CREATE SUBSCRIPTION mysub2 CONNECTION 'host=localhost dbname=pubdb'
PUBLICATION mypub WITH (local=true);
```
When "local=true" is set, subdb's backend end worker process stores the
current transaction id as well as its process id into a shared memory area
before sending "CREATE REPLICATION SLOT" command to pub side.
In pub side, if the shared memory area exists, and the xact id stored in
the shared memory area is the only ongoing xact, then it can skip waiting
for it. Then the deadlock is avoided.
```
subdb=# CREATE SUBSCRIPTION mysub2 CONNECTION 'host=localhost dbname=pubdb'
PUBLICATION mypub WITH (local=true);
NOTICE: created replication slot "mysub2" on publisher
CREATE SUBSCRIPTION
# data are properly replicated
subdb=# select * from t;
id | data
----+----------------
1 | hello from pub
2 | 2
3 | 3
(3 rows)
```
I think this solution is relatively safe. Because without specifying
"local=true", this patch will not impact anything. And a local logical
replication is not something should be done in a production environment.
This patch has a limitation. In sub side, you cannot run multiple "CREATE
SUBSCRIPTION WITH (local=true)" concurrently, otherwise they may still
trigger deadlock. But I don't think this is a big issue, because the
"local=true" suppose to be only used in dev environment, and we can clearly
state the limitation in doc.
The attached patch is unpolished. You may download and test it. Please
mainly focus on the design for now. Once the design is agreed by the
community, I will polish the change, add tests and update the doc.
[1]
https://www.postgresql.org/message-id/20170426165954.GK14000%40momjian.us
Chao Li (Evan)
---------------------
HighGo Software Co., Ltd.
https://www.highgo.com/
Attachment | Content-Type | Size |
---|---|---|
v1-0001-Allow-logical-replication-in-the-same-cluster.patch | application/octet-stream | 8.3 KB |
From | Date | Subject | |
---|---|---|---|
Next Message | Hayato Kuroda (Fujitsu) | 2025-09-05 02:55:14 | RE: Resetting recovery target parameters in pg_createsubscriber |
Previous Message | Nathan Bossart | 2025-09-05 02:00:22 | Re: [PATCH] Add tests for Bitmapset |