Re: Handle infinite recursion in logical replication setup

From: vignesh C <vignesh21(at)gmail(dot)com>
To: Peter Smith <smithpb2250(at)gmail(dot)com>
Cc: "houzj(dot)fnst(at)fujitsu(dot)com" <houzj(dot)fnst(at)fujitsu(dot)com>, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, Dilip Kumar <dilipbalaut(at)gmail(dot)com>, Alvaro Herrera <alvherre(at)alvh(dot)no-ip(dot)org>, Ashutosh Bapat <ashutosh(dot)bapat(dot)oss(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, "kuroda(dot)hayato(at)fujitsu(dot)com" <kuroda(dot)hayato(at)fujitsu(dot)com>, "shiy(dot)fnst(at)fujitsu(dot)com" <shiy(dot)fnst(at)fujitsu(dot)com>, "Jonathan S(dot) Katz" <jkatz(at)postgresql(dot)org>
Subject: Re: Handle infinite recursion in logical replication setup
Date: 2022-09-01 18:39:11
Message-ID: CALDaNm0NRJ1O1cYcZD=f7NgynozFprb7zpJSayFN5rcaS44G6Q@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Wed, 31 Aug 2022 at 11:35, Peter Smith <smithpb2250(at)gmail(dot)com> wrote:
>
> Here are my review comments for v43-0001.
>
> ======
>
> 1. Commit message
>
> The initial copy phase has no way to know the origin of the row data,
> so if 'copy_data = true' in the step 4 below, log a warning to notify user
> that potentially non-local data might have been copied.
>
> 1a.
> "in the step 4" -> "in step 4"

Modified

> ~
>
> 1b.
> "notify user" -> "notify the user"

Modified

> ======
>
> 2. doc/src/sgml/ref/create_subscription.sgml
>
> + <para>
> + If the subscription is created with <literal>origin = none</literal> and
> + <literal>copy_data = true</literal>, it will check if the publisher has
> + subscribed to the same table from other publishers and, if so, log a
> + warning to notify the user to check the publisher tables. Before continuing
> + with other operations the user should check that publisher tables did not
> + have data with different origins to prevent data inconsistency issues on the
> + subscriber.
> + </para>
>
> I am not sure about that wording saying "to prevent data inconsistency
> issues" because I thought maybe is already too late because any
> unwanted origin data is already copied during the initial sync. I
> think the user can do it try to clean up any problems caused before
> things become even worse (??)
>
> ~~~
>
> You asked for my thoughts (see [1] 6b):
>
> > There is nothing much a user can do in this case. Only option would be
> > to take a backup before the operation and restore it and then recreate
> > the replication setup. I was not sure if we should document these
> > steps as similar inconsistent data could get created without the
> > origin option . Thoughts?
>
> I think those cases are a bit different:
>
> - For a normal scenario (i.e. origin=ANY) then inconsistent data just
> refers to problems like maybe PK violations so I think you just get
> what you get and have to deal with the errors...
>
> - But when the user said origin=NONE they are specifically saying they
> do NOT want any non-local data, yet they still might end up getting
> data contrary to their wishes. So I think maybe there ought to be
> something documented to warn about this potentially happening. My
> first impression is that all this seems like a kind of terrible
> problem because if it happens then cleanup could be difficult - if
> that subscriber in turn was also a publisher then this bad data might
> have propagated all over the place already! Anyway, I guess all this
> could be mentioned in some (new ?) section of the
> logical-replication.sgml file (i.e. patch 0002).
>
> IDEA: I was wondering if the documentation should recommend (or maybe
> we can even enforce this) that when using origin=NONE the user should
> always create the subscription with enabled=FALSE. That way if there
> are some warnings about potential bad origin data then there it might
> be possible to take some action or change their mind *before* any
> unwanted data pollutes everything.

Updated document based on your suggestion.

> ======
>
> 3. src/backend/commands/subscriptioncmds.c - check_pub_table_subscribed
>
> + if (count == 0)
> + tablenames = makeStringInfo();
> + else
> + appendStringInfoString(tablenames, ", ");
> +
> + appendStringInfo(tablenames, "%s.%s", nspname, relname);
> + count++;
>
> I think the quotes are not correct - each separate table name should be quoted.

We will not be displaying tables anymore, the comment does not apply anymore

> ~~~
>
> 4.
>
> + /*
> + * There might be many tables present in this case, we will display a
> + * maximum of 100 tables followed by "..." to indicate there might be
> + * more tables.
> + */
> + if (count == 100)
> + {
> + appendStringInfoString(tablenames, ", ...");
> + break;
> + }
>
> 4a.
> IMO this code should be part of the previous if/else
>
> e.g.
> if (count == 0) ... makeStringInfo()
> else if (count > 100) ... append ellipsis "..." and break out of the loop
> else ... append table name to the message
>
> Doing it in this way means "..." definitely means there are more than
> 100 bad tables.

This code is removed, the comment is no more applicable.

> ~
>
> 4b.
> Changing like above (#4a) simplifies the comment because it removes
> doubts like "might be…" since you already know you found the 101st
> table.
>
> SUGGESTION (comment)
> The message displays a maximum of 100 tables having potentially
> non-local data. Any more than that will be indicated by "...".

This code is removed, the comment is no more applicable.

> ~
>
> 4c.
> It still seems a bit disconcerting to me that there can be cases where
> bad data was possibly copied but there is no record of what tables
> have been polluted. Perhaps when the maximum bad tables are exceeded
> then there can be some additional message to tell users what SQL they
> can use to discover what all the other bad tables were.

We will not be displaying the tablename anymore, we will be displaying
the publication names which has been subscribed from other
publications.

> ~~~
>
> 5.
>
> + * XXX: For simplicity, we don't check whether the table has any data
> + * or not. If the table doesn't have any data then we don't need to
> + * distinguish between data having origin and data not having origin so
> + * we can avoid throwing a warning in that case.
>
> "throwing a warning" -> "logging a warning"

Modified

> ~~~
>
> 6.
>
> + errdetail("Subscribed table(s) \"%s\" has been subscribed from some
> other publisher.",
> + tablenames->data),
>
> I don't think the table name list should be quoted in the errdetail,
> because each of the individual table names should have been quoted.
> See previous review comment #3.

Modified

> ======
>
> 7. src/test/subscription/t/030_origin.pl
>
> +###############################################################################
> +# Specify origin as 'none' which indicates that the publisher should only
> +# replicate the changes that are generated locally from node_B, but in
> +# this case since the node_B is also subscribing data from node_A, node_B can
> +# have remotely originated data from node_A. We log a warning, in this case,
> +# to draw attention to there being possible remote data.
> +###############################################################################
>
> This comment wording makes it seem like this is using different
> subscription parameters from the other tests that preceded it, but in
> fact the prior tests also been using origin=none. (??)
>
> Maybe it just needs a trivial rewording.
>
> SUGGESTION
> Specifying origin=NONE indicates that...

Modified

> ~~~
>
> 8.
>
> +like(
> + $stderr,
> + qr/WARNING: ( [A-Z0-9]+:)? subscription tap_sub_a_b_2 requested
> origin=NONE but might copy data that had a different origin/,
> + "Create subscription with origin = none and copy_data when the
> publisher has subscribed same table"
> +);
>
> Is it possible for these checks of the log file to get the errdetail
> message too? Then the test will be more readable for the expected
> result and you will also have the bonus of testing the table-name list
> in the warning more thoroughly.

I did not see any tap test doing this, I did not want to make some
change which would fail in buildfarm machines. I have not done any
change for this.

Thanks for the comments, the attached v44 patch has the changes for the same.

Regards,
Vignesh

Attachment Content-Type Size
v44-0001-Check-and-log-a-warning-if-the-publisher-has-sub.patch application/octet-stream 21.9 KB
v44-0002-Document-the-steps-for-replication-between-prima.patch application/octet-stream 18.8 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message vignesh C 2022-09-01 18:44:09 Re: Handle infinite recursion in logical replication setup
Previous Message Andres Freund 2022-09-01 18:35:40 Re: Tracking last scan time