Re: Multi-Master Logical Replication

From: vignesh C <vignesh21(at)gmail(dot)com>
To: Peter Smith <smithpb2250(at)gmail(dot)com>
Cc: PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: Multi-Master Logical Replication
Date: 2022-05-11 10:16:04
Message-ID: CALDaNm0BLFfb=m_VxFA5KaLuQnP9Q0KV43mazNTVMmD2X9FsCg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Thu, Apr 28, 2022 at 5:20 AM Peter Smith <smithpb2250(at)gmail(dot)com> wrote:
>
> MULTI-MASTER LOGICAL REPLICATION
>
> 1.0 BACKGROUND
>
> Let’s assume that a user wishes to set up a multi-master environment
> so that a set of PostgreSQL instances (nodes) use logical replication
> to share tables with every other node in the set.
>
> We define this as a multi-master logical replication (MMLR) node-set.
>
> <please refer to the attached node-set diagram>
>
> 1.1 ADVANTAGES OF MMLR
>
> - Increases write scalability (e.g., all nodes can write arbitrary data).
> - Allows load balancing
> - Allows rolling updates of nodes (e.g., logical replication works
> between different major versions of PostgreSQL).
> - Improves the availability of the system (e.g., no single point of failure)
> - Improves performance (e.g., lower latencies for geographically local nodes)
>
> 2.0 MMLR AND POSTGRESQL
>
> It is already possible to configure a kind of MMLR set in PostgreSQL
> 15 using PUB/SUB, but it is very restrictive because it can only work
> when no two nodes operate on the same table. This is because when two
> nodes try to share the same table then there becomes a circular
> recursive problem where Node1 replicates data to Node2 which is then
> replicated back to Node1 and so on.
>
> To prevent the circular recursive problem Vignesh is developing a
> patch [1] that introduces new SUBSCRIPTION options "local_only" (for
> publishing only data originating at the publisher node) and
> "copy_data=force". Using this patch, we have created a script [2]
> demonstrating how to set up all the above multi-node examples. An
> overview of the necessary steps is given in the next section.
>
> 2.1 STEPS – Adding a new node N to an existing node-set
>
> step 1. Prerequisites – Apply Vignesh’s patch [1]. All nodes in the
> set must be visible to each other by a known CONNECTION. All shared
> tables must already be defined on all nodes.
>
> step 2. On node N do CREATE PUBLICATION pub_N FOR ALL TABLES
>
> step 3. All other nodes then CREATE SUBSCRIPTION to PUBLICATION pub_N
> with "local_only=on, copy_data=on" (this will replicate initial data
> from the node N tables to every other node).
>
> step 4. On node N, temporarily ALTER PUBLICATION pub_N to prevent
> replication of 'truncate', then TRUNCATE all tables of node N, then
> re-allow replication of 'truncate'.
>
> step 5. On node N do CREATE SUBSCRIPTION to the publications of all
> other nodes in the set
> 5a. Specify "local_only=on, copy_data=force" for exactly one of the
> subscriptions (this will make the node N tables now have the same
> data as the other nodes)
> 5b. Specify "local_only=on, copy_data=off" for all other subscriptions.
>
> step 6. Result - Now changes to any table on any node should be
> replicated to every other node in the set.
>
> Note: Steps 4 and 5 need to be done within the same transaction to
> avoid loss of data in case of some command failure. (Because we can't
> perform create subscription in a transaction, we need to create the
> subscription in a disabled mode first and then enable it in the
> transaction).
>
> 2.2 DIFFICULTIES
>
> Notice that it becomes increasingly complex to configure MMLR manually
> as the number of nodes in the set increases. There are also some
> difficulties such as
> - dealing with initial table data
> - coordinating the timing to avoid concurrent updates
> - getting the SUBSCRIPTION options for copy_data exactly right.
>
> 3.0 PROPOSAL
>
> To make the MMLR setup simpler, we propose to create a new API that
> will hide all the step details and remove the burden on the user to
> get it right without mistakes.
>
> 3.1 MOTIVATION
> - MMLR (sharing the same tables) is not currently possible
> - Vignesh's patch [1] makes MMLR possible, but the manual setup is
> still quite difficult
> - An MMLR implementation can solve the timing problems (e.g., using
> Database Locking)
>
> 3.2 API
>
> Preferably the API would be implemented as new SQL functions in
> PostgreSQL core, however, implementation using a contrib module or
> some new SQL syntax may also be possible.
>
> SQL functions will be like below:
> - pg_mmlr_set_create = create a new set, and give it a name
> - pg_mmlr_node_attach = attach the current node to a specified set
> - pg_mmlr_node_detach = detach a specified node from a specified set
> - pg_mmlr_set_delete = delete a specified set
>
> For example, internally the pg_mmlr_node_attach API function would
> execute the equivalent of all the CREATE PUBLICATION, CREATE
> SUBSCRIPTION, and TRUNCATE steps described above.
>
> Notice this proposal has some external API similarities with the BDR
> extension [3] (which also provides multi-master logical replication),
> although we plan to implement it entirely using PostgreSQL’s PUB/SUB.
>
> 4.0 ACKNOWLEDGEMENTS
>
> The following people have contributed to this proposal – Hayato
> Kuroda, Vignesh C, Peter Smith, Amit Kapila.
>
> 5.0 REFERENCES
>
> [1] https://www.postgresql.org/message-id/flat/CALDaNm0gwjY_4HFxvvty01BOT01q_fJLKQ3pWP9%3D9orqubhjcQ%40mail.gmail.com
> [2] https://www.postgresql.org/message-id/CAHut%2BPvY2P%3DUL-X6maMA5QxFKdcdciRRCKDH3j%3D_hO8u2OyRYg%40mail.gmail.com
> [3] https://www.enterprisedb.com/docs/bdr/latest/
>
> [END]
>
> ~~~
>
> One of my colleagues will post more detailed information later.

MMLR is changed to LRG(Logical replication group) to avoid confusions.

The LRG functionality will be implemented as given below:
The lrg contrib module provides a set of API to allow setting up
bi-directional logical replication among different nodes. The lrg
stands for Logical Replication Group.
To use this functionality shared_preload_libraries must be set to lrg like:
shared_preload_libraries = lrg
A new process "lrg launcher" is added which will be launched when the
extension is created. This process is responsible for checking if user
has created new logical replication group or if the user is attaching
a new node to the logical replication group or detach a node or drop a
logical replication group and if so, then launches another new “lrg
worker” for the corresponding database.
The new process "lrg worker" is responsible for handling the core
tasks of lrg_create, lrg_node_attach, lrg_node_detach and lrg_drop
functionality.
The “lrg worker” is required here because there are a lot of steps
involved in this process like create publication, create subscription,
alter publication, lock table, etc. If there is a failure during any
of the process, the worker will be restarted and is responsible to
continue the operation from where it left off to completion.
The following new tables were added to maintain the logical
replication group related information:
-- pg_lrg_info table to maintain the logical replication group information.
CREATE TABLE lrg.pg_lrg_info
(
groupname text PRIMARY KEY, -- name of the logical replication group
pubtype text – type of publication(ALL TABLES, SCHEMA, TABLE)
currently only “ALL TABLES” is supported
);

-- pg_ lrg_nodes table to maintain the node information that are
members of the logical replication group.
CREATE TABLE lrg.pg_lrg_nodes
(
nodeid text PRIMARY KEY, -- node id (actual node_id format is
still not finalized)
groupname text REFERENCES pg_lrg_info(groupname), -- name of the
logical replication group
dbid oid NOT NULL, -- db id
status text NOT NULL, -- status of the node
nodename text, -- node name
localconn text NOT NULL, -- local connection string
upstreamconn text – upstream connection string to connect to
another node already in the logical replication group
);

-- pg_ lrg_pub table to maintain the publications that were created
for this node.
CREATE TABLE lrg.pg_lrg_pub
(
groupname text REFERENCES pg_lrg_info(groupname), -- name of the
logical replication group
pubid oid NOT NULL – oid of the publication
);

-- pg_lrg_sub table to maintain the subscriptions that were created
for this node.
CREATE TABLE lrg.pg_lrg_sub
(
groupname text REFERENCES pg_lrg_info(groupname), -- name of the
logical replication group
subid oid NOT NULL– oid of the subscription
);

The following functionality was added to support the various logical
replication group functionalities:
lrg_create(group_name text, pub_type text, local_connection_string
text, node_name text)
lrg _node_attach(group_name text, local_connection_string text,
upstream_connection_string text, node_name text)
lrg_node_detach(group_name text, node_name text)
lrg_drop(group_name text)
-----------------------------------------------------------------------------------------------------------------------------------

lrg_create – This function creates a logical replication group as
specified in group_name.
example:
postgres=# SELECT lrg.lrg_create('test', 'FOR ALL TABLES',
'user=postgres port=5432', 'testnode1');

This function adds a logical replication group “test” with pubtype as
“FOR ALL TABLES” to pg_lrg_info like given below:
postgres=# select * from lrg. pg_lrg_info;
groupname | pubtype
----------+------------------
test | FOR ALL TABLES
(1 row)

It adds node information which includes the node id, database id,
status, node name, connection string and upstream connection string to
pg_lrg_nodes like given below:
postgres=# select * from lrg.pg_lrg_nodes ;
nodeid | groupname |
dbid | status | nodename | localconn | upstreamconn
-------------------------------------------------------------+------+--------+-----------+-----------------------------------------+-----------------------------------------
70934590432710321605user=postgres port=5432 | test | 5 | ready |
testnode1 | user=postgres port=5432 |
(1 row)

The “lrg worker” will perform the following:
1) It will lock the pg_lrg_info and pg_lrg_nodes tables.
2) It will create the publication in the current node.
3) It will change the (pg_lrg_nodes) status from init to createpublication.
4) It will unlock the pg_lrg_info and pg_lrg_nodes tables
5) It will change the (pg_lrg_nodes) status from createpublication to ready.
-----------------------------------------------------------------------------------------------------------------------------------

lrg_node_attach – Attach the specified node to the specified logical
replication group.
example:
postgres=# SELECT lrg.lrg_node_attach('test', 'user=postgres
port=9999', 'user=postgres port=5432', 'testnode2')
This function adds logical replication group “test” with pubtype as
“FOR ALL TABLES” to pg_lrg_info in the new node like given below:
postgres=# select * from pg_lrg_info;
groupname | pubtype
----------+------------------
test | FOR ALL TABLES
(1 row)

This is the same group name that was added during lrg_create in the
create node. Now this information will be available in the new node
too. This information will help the user to attach to any of the nodes
present in the logical replication group.
It adds node information which includes the node id, database id,
status, node name, connection string and upstream connection string of
the current node and the other nodes that are part of the logical
replication group to pg_lrg_nodes like given below:
postgres=# select * from lrg.pg_lrg_nodes ;
nodeid | groupname |
dbid | status | nodename | localconn | upstreamconn
-------------------------------------------------------------+------+--------+-----------+-----------------------------------------+-----------------------------------------
70937999584732760095user=vignesh dbname=postgres port=9999 | test |
5 | ready | testnode2 | user=vignesh dbname=postgres port=9999 |
user=vignesh dbname=postgres port=5432
70937999523629205245user=vignesh dbname=postgres port=5432 | test |
5 | ready | testnode1 | user=vignesh dbname=postgres port=5432 |
(2 rows)

It will use the upstream connection to connect to the upstream node
and get the nodes that are part of the logical replication group.
Note: The nodeid used here is for illustrative purpose, actual nodeid
format is still not finalized.
For this API the “lrg worker” will perform the following:
1) It will lock the pg_lrg_info and pg_lrg_nodes tables.
2) It will connect to the upstream node specified and get the list of
other nodes present in the logical replication group.
3) It will connect to the remaining nodes and lock the database so
that no new operations are performed.
4) It will wait in the upstream node till it reaches the latest lsn of
the remaining nodes, this is somewhat similar to wait_for_catchup
function in tap tests.
5) It will change the status (pg_lrg_nodes) from init to waitforlsncatchup.
6) It will create the publication in the current node.
7) It will change the status (pg_lrg_nodes) from waitforlsncatchup to
createpublication.
8) It will create a subscription in all the remaining nodes to get the
data from new node.
9) It will change the status (pg_lrg_nodes) from createpublication to
createsubscription.
10) It will alter the publication not to replicate truncate operation.
11) It will truncate the table.
12) It will alter the publication to include sending the truncate operation.
13) It will create a subscription in the current node to subscribe the
data with copy_data force.
14) It will create a subscription in the remaining nodes to subscribe
the data with copy_data off.
15) It will unlock the database in all the remaining nodes.
16) It will unlock the pg_lrg_info and pg_lrg_nodes tables.
17) It will change the status (pg_lrg_nodes) from createsubscription to ready.

The status will be useful to display the progress of the operation to
the user and help in failure handling to continue the operation from
the state it had failed.
-----------------------------------------------------------------------------------------------------------------------------------

lrg_node_detach – detach a node from the logical replication group.
example:
postgres=# SELECT lrg.lrg_node_detach('test', 'testnode');
For this API the “lrg worker” will perform the following:
1) It will lock the pg_lrg_info and pg_lrg_nodes tables.
2) It will get the list of other nodes present in the logical replication group.
3) It will connect to the remaining nodes and lock the database so
that no new operations are performed.
4) It will drop the subscription in all the nodes corresponding to
this node of the cluster.
5) It will drop the publication in the current node.
6) It will remove all the data associated with this logical
replication group from pg_lrg_* tables.
7) It will unlock the pg_lrg_info and pg_lrg_nodes tables.
-----------------------------------------------------------------------------------------------------------------------------------

lrg_drop - drop a group from logical replication groups.
example:
postgres=# SELECT lrg.lrg_drop('test');

This function removes the group specified from the logical replication
groups. This function must be executed at the member of a given
logical replication group.
For this API the “lrg worker” will perform the following:
1) It will lock the pg_lrg_info and pg_lrg_nodes tables..
2) DROP PUBLICATION of this node that was created for this logical
replication group.
3) Remove all data from the logical replication group system table
associated with the logical replication group.
4) It will unlock the pg_lrg_info and pg_lrg_nodes tables.

If there are no objections the API can be implemented as SQL functions
in PostgreSQL core and the new tables can be created as system tables.

Thoughts?

Regards,
Vignesh

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Ilya Anfimov 2022-05-11 10:18:04 To add pg_temp schema description to schemas documentation
Previous Message Peter Eisentraut 2022-05-11 10:11:49 Re: [RFC] building postgres with meson -v8