Re: Handle infinite recursion in logical replication setup

From: vignesh C <vignesh21(at)gmail(dot)com>
To: Peter Smith <smithpb2250(at)gmail(dot)com>
Cc: PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: Handle infinite recursion in logical replication setup
Date: 2022-04-08 07:40:05
Message-ID: CALDaNm2-fkBfgv16xJdd=bNzJrLTEd2Y=ySEM3CczEnvP6=N0w@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Tue, Apr 5, 2022 at 8:17 AM Peter Smith <smithpb2250(at)gmail(dot)com> wrote:
>
> Here are my comments for the latest patch v6-0002.
>
> PATCH v6-0002 comments
> ======================
>
> 2.1 General - should this be an independent patch?
>
> In many ways, I think most of this patch is unrelated to the other
> "local_only" patch (v6-0001).
>
> For example, IIUC even in the current HEAD, we could consider it to be
> a user error if multiple SUBSCRIPTIONS or multiple PUBLICATIONS of the
> same SUBSCRIPTION are replicating to the same TABLE on the same node
> and using "copy_data = on".
>
> So I think it would be ok to throw an ERROR if such a copy_data clash
> is detected, and then the user will have to change to use "copy_data =
> off" for some/all of them to avoid data duplication.
>
> The "local_only" option only adds some small logic to this new ERROR,
> but it's not really a prerequisite at all.
>
> e.g. this whole ERROR part of the patch can be a separate thread.

As Amit also pointed out, the patches have some dependency, I will
keep it as it is.

> ~~~
>
> 2.2 General - can we remove the "force" enum?
>
> Now, because I consider the clashing "copy_data = on" ERROR to be a
> user error, I think that is something that the user can already take
> care of themselves just using the copy_data = off.
>
> I did not really like the modifying of the "copy_data" option from
> just boolean to some kind of hybrid boolean + "force".
>
> a) It smells a bit off to me. IMO replication is supposed to end up
> with the same (replicated) data on the standby machine but this
> "force" mode seems to be just helping the user to break that concept
> and say - "I know what I'm doing, and I don't care if I get lots of
> duplicated data in the replica table - just let me do it"...
>
> b) It also somehow feels like the new "force" was introduced mostly to
> make the code ERROR handling implementation simpler, rather than to
> make the life of the end-user better. Yes, if force is removed maybe
> the copy-clash-detection-code will need to be internally quite more
> complex than it is now, but that is how it should be, instead of
> putting extra burden on the user (e.g. by complicating the PG docs and
> giving them yet more alternatives to configure). I think any clashing
> copy_data options really is a user error, but also I think the current
> boolean copy_data true/false already gives the user a way to fix it.
>
> c) Actually, your new error hint messages are similar to my
> perspective: They already say "Use CREATE/ALTER SUBSCRIPTION with
> copy_data = off or force". All I am saying is remove the "force", and
> the user can still fix the problem just by using "copy_data = off"
> appropriately.

When a user is trying to add a node to bidirectional replication
setup, the user will use the force option to copy the data from one of
the nodes and use off to skip copying the data from other nodes. This
option will be used while adding nodes to bidirectional replication,
the same has been documented with examples in the patch. I felt we
should retain this option.

> ======
>
> So (from above) I am not much in favour of the copy_data becoming a
> hybrid enum and using "force", yet that is what most of this patch is
> implementing. Anyway, the remainder of my review comments here are for
> the code in its current form. Maybe if "force" can be removed most of
> the following comments end up being redundant.
>
> ======
>
> 2.3 Commit message - wording
>
> This message is difficult to understand.
>
> I think that the long sentence "Now when user is trying..." can be
> broken into more manageable parts.

Slightly modified

> This part "and throw an error so that user can handle the initial copy
> data." also seemed a bit vague.

I have given the reference to the documentation section of the patch
for initial data copy handling

> ~~~
>
> 2.4 Commit message - more functions
>
> "This patch does couple of things:"
>
> IIUC, there seems a third thing implemented by this patch but not
> described by the comment. I think it also adds support for ALTER
> SUBSCRIPTION SET PUBLICATION WITH (subscribe_local_only)

This is part of 0001 patch, I felt this should not be part of 002 patch commit.

> ~~~
>
> 2.5 doc/src/sgml/ref/create_subscription.sgml - wording
>
> @@ -161,6 +161,13 @@ CREATE SUBSCRIPTION <replaceable
> class="parameter">subscription_name</replaceabl
> the replicated changes that was generated from other nodes. The
> default is <literal>false</literal>.
> </para>
> + <para>
> + If the tables in the publication were also subscribing to the data in
> + the publisher from other publishers, it will affect the
> + <command>CREATE SUBSCRIPTION</command> based on the value specified
> + for <literal>copy_data</literal> option. Refer to the
> + <xref linkend="sql-createsubscription-notes" /> for details.
> + </para>
>
> Is there is a simpler way to express all that?
>
> SUGGESTION
> There is some interation between the option "subscribe_local_only" and
> option "copy_data". Refer to the <xref
> linkend="sql-createsubscription-notes" /> for details.

Modified

> ~~~
>
> 2.6 doc/src/sgml/ref/create_subscription.sgml - whitespace
>
> @@ -213,18 +220,28 @@ CREATE SUBSCRIPTION <replaceable
> class="parameter">subscription_name</replaceabl
> </varlistentry>
>
> <varlistentry>
> - <term><literal>copy_data</literal> (<type>boolean</type>)</term>
> + <term><literal>copy_data</literal> (<type>enum</type>)</term>
> <listitem>
> <para>
> Specifies whether to copy pre-existing data in the publications
> - that are being subscribed to when the replication starts.
> - The default is <literal>true</literal>.
> + that are being subscribed to when the replication starts. This
> + parameter may be either <literal>true</literal>,
> + <literal>false</literal> or <literal>force</literal>. The default is
> + <literal>true</literal>.
>
> That last line has trailing whitespace.

Modified

> ~~~
>
> 2.7 doc/src/sgml/ref/create_subscription.sgml - wording
>
> + <para>
> + If the tables in the publication were also subscribing to the data in
> + the publisher from other publishers, it will affect the
> + <command>CREATE SUBSCRIPTION</command> based on the value specified
> + for <literal>subscribe_local_only</literal> option. Refer to the
> + <xref linkend="sql-createsubscription-notes" /> for details.
> + </para>
>
> This is similar to review comment #2.5 which I thought could be
> written in a simpler way.

Modified

> ~~~
>
> 2.8 doc/src/sgml/ref/create_subscription.sgml
>
> @@ -375,6 +392,42 @@ CREATE SUBSCRIPTION <replaceable
> class="parameter">subscription_name</replaceabl
> can have non-existent publications.
> </para>
>
> + <para>
> + Let's consider an existing Multi master logical replication setup between
> + Node1 and Node2 that is created using the following steps:
> + a) Node1 - Publication publishing employee table.
> + b) Node2 - Subscription subscribing from publication pub1 with
> + <literal>subscribe_local_only</literal>.
> + c) Node2 - Publication publishing employee table.
> + d) Node1 - Subscription subscribing from publication pub2 with
> + <literal>subscribe_local_only</literal>.
> + Now when user is trying to add another node Node3 to the above Multi master
> + logical replication setup, user will have to create one subscription
> + subscribing from Node1 and another subscription subscribing from Node2 in
> + Node3 using <literal>subscribe_local_only</literal> option and
> + <literal>copy_data</literal> as <literal>true</literal>, while
> + the subscription is created, server will identify that Node2 is subscribing
> + from Node1 and Node1 is subscribing from Node2 and throw an error like:
> + <programlisting>
> +postgres=# CREATE SUBSCRIPTION mysub CONNECTION 'host=192.168.1.50
> port=5432 user=foo dbname=foodb'
> + PUBLICATION mypublication, insert_only with
> (subscribe_local_only=on);
> +ERROR: CREATE/ALTER SUBSCRIPTION with subscribe_local_only and
> copy_data as true is not allowed when the publisher might have
> replicated data, table:public.t1 might have replicated data in the
> publisher
> +HINT: Use CREATE/ALTER SUBSCRIPTION with copy_data = off or force
> + </programlisting>
> + In this scenario user can solve this based on one of the 2 possibilities,
> + a) If there are no data present in Node1 and Node2, then the user can create
> + the subscriptions to Node1 and Node2 with
> + <literal>subscribe_local_only</literal> as <literal>true</literal> and
> + <literal>copy_data</literal> as <literal>false</literal>. b) If the data is
> + present, then the user can create subscription with
> + <literal>copy_data</literal> as <literal>force</literal> on Node1 and
> + <literal>copy_data</literal> as <literal>false</literal> on Node2, before
> + allowing any operations on the respective tables of Node1 and Node2, in this
> + case <literal>copy_data</literal> is <literal>false</literal> on Node2
> + because the data will be replicated to each other and available on both the
> + nodes.
> + </para>
> +
>
> That is a large slab of text in the Notes, so not very easy to digest it.
>
> I'm not sure what to suggest for this -
> - Perhaps the a,b,c,d should all be "lists" so it renders differently?
> - It almost seems like too much information to be included in this
> "Notes" section, Maybe it needs its own full page in PG Docs "Logical
> Replication" to discuss this topic.

This is moved to the new page "BiDirectional logical replication Quick
Setup" page now.

> ~~~
>
> 2.9 src/backend/commands/subscriptioncmds.c - IS_COPY_DATA_VALID
>
> @@ -69,6 +69,18 @@
> /* check if the 'val' has 'bits' set */
> #define IsSet(val, bits) (((val) & (bits)) == (bits))
>
> +#define IS_COPY_DATA_VALID(copy_data) (copy_data != COPY_DATA_OFF)
>
> The macro seems misnamed because "off" is also "valid". It seems like
> it should be called something different like IS_COPY_DATA, or
> IS_COPY_DATA_ON_OR_FORCE, etc.

Modified

> ~~~
>
> 2.10 src/backend/commands/subscriptioncmds.c - DefGetCopyData
>
> + /*
> + * The set of strings accepted here should match up with
> + * the grammar's opt_boolean_or_string production.
> + */
> + if (pg_strcasecmp(sval, "true") == 0)
> + return COPY_DATA_ON;
> + if (pg_strcasecmp(sval, "false") == 0)
> + return COPY_DATA_OFF;
> + if (pg_strcasecmp(sval, "on") == 0)
> + return COPY_DATA_ON;
> + if (pg_strcasecmp(sval, "off") == 0)
> + return COPY_DATA_OFF;
> + if (pg_strcasecmp(sval, "force") == 0)
> + return COPY_DATA_FORCE;
>
> Maybe combine the return for true/on and false/off?

Modified

> ~~~
>
> 2.11 src/backend/commands/subscriptioncmds.c - fetch_table_list
>
> + appendStringInfoString(&cmd,
> + "SELECT DISTINCT N.nspname AS schemaname, C.relname AS tablename,
> PS.srsubstate as replicated\n"
> + "FROM pg_publication P,\n"
> + "LATERAL pg_get_publication_tables(P.pubname) GPT\n"
> + "LEFT JOIN pg_subscription_rel PS ON (GPT.relid = PS.srrelid),\n"
> + "pg_class C JOIN pg_namespace N ON (N.oid = C.relnamespace)\n"
> + "WHERE C.oid = GPT.relid AND P.pubname in (");
> +
>
> That blank line is not needed (it was not there previously) because
> the next get_publications_str is a continuation of this SQL.

Modified

> ~~~
>
> 2.12 src/backend/commands/subscriptioncmds.c - fetch_table_list comment
>
> + * It is quite possible that subscriber has not yet pulled data to
> + * the tables, but in ideal cases the table data will be subscribed.
> + * Too keep the code simple it is not checked if the subscriber table
> + * has pulled the data or not.
> + */
>
> Typo "Too" -> "To"

Modified

> ~~~
>
> 2.13 src/backend/commands/subscriptioncmds.c - force
>
> + if (copydata == COPY_DATA_ON && only_local && !slot_attisnull(slot, 3))
> + ereport(ERROR,
> + errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
> + errmsg("CREATE/ALTER SUBSCRIPTION with subscribe_local_only and
> copy_data as true is not allowed when the publisher might have
> replicated data, table:%s.%s might have replicated data in the
> publisher",
> + nspname, relname),
> + errhint("Use CREATE/ALTER SUBSCRIPTION with copy_data = off or force"));
>
> AFAIK this is the only code of this patch that is distinguishing
> between "force" and the "on". As I wrote at the beginning of this post
> I feel you can keep this ERROR (maybe the way to detect it needs to be
> more complex) and the user can fix it just by using "copy_data = off"
> appropriately. So then copy_data does not need to be changed.

When a user is trying to add a node to bidirectional replication
setup, the user will use the force option to copy the data from one of
the nodes and use off to skip copying the data from other nodes. This
option will be used while adding nodes to bidirectional replication,
the same has been documented with examples in the patch. I felt we
should retain this option.

> ~~~
>
> 2.14 src/test/regress/sql/subscription.sql - missing tests
>
> The new copy_data is not really an enum true/false/force like the PG
> docs claims.
>
> It seems more like some kind of hybrid boolean+force. So if that is
> how it is going to be then there are missing test cases to make sure
> that values like "on"/"off"/"0"/"1" still are working.

Added tests

> ~~~
>
> 2.15 src/test/subscription/t/032_circular.pl
>
> @@ -65,6 +69,25 @@ $node_A->safe_psql('postgres', "
> PUBLICATION tap_pub_B
> WITH (subscribe_local_only = on, copy_data = off)");
>
> +($result, $stdout, $stderr) = $node_A->psql('postgres', "
> + CREATE SUBSCRIPTION tap_sub_A1
> + CONNECTION '$node_B_connstr application_name=$appname_A'
> + PUBLICATION tap_pub_B
> + WITH (subscribe_local_only = on, copy_data = on)");
> +like(
> + $stderr,
> + qr/ERROR: CREATE\/ALTER SUBSCRIPTION with
> subscribe_local_only and copy_data as true is not allowed when the
> publisher might have replicated data/,
> + "Create subscription with subscribe_local_only and copy_data
> having replicated table in publisher");
> +
> +$node_A->safe_psql('postgres', "
> + CREATE SUBSCRIPTION tap_sub_A2
> + CONNECTION '$node_B_connstr application_name=$appname_A'
> + PUBLICATION tap_pub_B
> + WITH (subscribe_local_only = on, copy_data = force)");
> +
> +$node_A->safe_psql('postgres', "
> + DROP SUBSCRIPTION tap_sub_A2");
> +
>
> Maybe underneath it is the same, but from the outside, this looks like
> a slightly different scenario from what is mentioned everywhere else
> in the patch.
>
> I think it would be better to create a new Node_C (aka Node3) so then
> the TAP test can use the same example that you give in the commit
> message and the PG docs notes.

Modified the tests to include Node_C.

Thanks for the comments, the v7 patch attached at [1] has the fixes for the same
[1] - https://www.postgresql.org/message-id/CALDaNm1ei%3DrRwCBKWtUu8b5OsS6FFcvaxg9h0oXcjgFn8GoZnQ%40mail.gmail.com

Regards,
Vignesh

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Kyotaro Horiguchi 2022-04-08 07:46:45 Re: BufferAlloc: don't take two simultaneous locks
Previous Message vignesh C 2022-04-08 07:30:04 Re: Handle infinite recursion in logical replication setup