RE: Data is copied twice when specifying both child and parent table in publication

From: "wangw(dot)fnst(at)fujitsu(dot)com" <wangw(dot)fnst(at)fujitsu(dot)com>
To: Peter Smith <smithpb2250(at)gmail(dot)com>
Cc: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, vignesh C <vignesh21(at)gmail(dot)com>, "Takamichi Osumi (Fujitsu)" <osumi(dot)takamichi(at)fujitsu(dot)com>, "shiy(dot)fnst(at)fujitsu(dot)com" <shiy(dot)fnst(at)fujitsu(dot)com>, "houzj(dot)fnst(at)fujitsu(dot)com" <houzj(dot)fnst(at)fujitsu(dot)com>, Amit Langote <amitlangote09(at)gmail(dot)com>, Peter Eisentraut <peter(dot)eisentraut(at)enterprisedb(dot)com>, "pgsql-hackers(at)lists(dot)postgresql(dot)org" <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Dilip Kumar <dilipbalaut(at)gmail(dot)com>, Greg Nancarrow <gregn4422(at)gmail(dot)com>
Subject: RE: Data is copied twice when specifying both child and parent table in publication
Date: 2023-03-22 10:07:49
Message-ID: OS3PR01MB6275A98C992F4DCDCD75578D9E869@OS3PR01MB6275.jpnprd01.prod.outlook.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Wed, Mar 22, 2023 at 12:50 PM Peter Smith <smithpb2250(at)gmail(dot)com> wrote:
> Here are some review comments for patch code of HEAD_v19-0001

Thanks for your comments.

> ======
> doc/src/sgml/ref/create_publication.sgml
>
> 1.
> + <para>
> + There can be a case where a subscription combines multiple
> + publications. If a root partitioned table is published by any
> + subscribed publications which set
> + <literal>publish_via_partition_root</literal> = true, changes on this
> + root partitioned table (or on its partitions) will be published using
> + the identity and schema of this root partitioned table rather than
> + that of the individual partitions.
> + </para>
>
> 1a.
> The paragraph prior to this one just refers to "partitioned tables"
> instead of "root partitioned table", so IMO we should continue with
> the same terminology.

Changed.

> I also modified the remaining text slightly. AFAIK my suggestion
> conveys exactly the same information but is shorter.
>
> SUGGESTION
> There can be a case where one subscription combines multiple
> publications. If any of those publications has set
> <literal>publish_via_partition_root</literal> = true, then changes in
> a partitioned table (or on its partitions) will be published using the
> identity and schema of the partitioned table.

Sorry, I'm not sure about this.
I'm not a native speaker of English, but it seems like the following use case is
not explained:
```
create table t1 (a int primary key) partition by range (a);
create table t2 (a int primary key) partition by range (a);
create table t3 (a int primary key);
alter table t1 attach partition t2 default;
alter table t2 attach partition t3 default;

create publication p1 for table t1;
create publication p2_via for table t2 with(publish_via_partition_root);
create publication p3 for table t3;
```
If we subscribe to p1, p2_via and p3 at the same time, then t2's identity and
schema will be used instead of t1's (and of course not t3's).

> ~
>
> 1b.
> Shouldn't that paragraph (or possibly somewhere in the CREATE
> SUBSCRIPTION notes?) also explain that in this scenario the logical
> replication will only publish one set of changes? After all, this is
> the whole point of the patch, but I am not sure if the user will know
> of this behaviour from the current documentation.

It seems to me that what you're explaining is what users expect. So, it seems we
don't need to explain it.
BTW IIUC, when user wants to use the "via_root" option, they should first read
the pg-doc to confirm the meaning and related notes of this option. So, I'm not
sure if adding this section in other documentation would be redundant.

> ======
> src/backend/catalog/pg_publication.c
>
> 2. filter_partitions
>
> BEFORE:
> static void
> filter_partitions(List *table_infos)
> {
> ListCell *lc;
>
> foreach(lc, table_infos)
> {
> bool skip = false;
> List *ancestors = NIL;
> ListCell *lc2;
> published_rel *table_info = (published_rel *) lfirst(lc);
>
> if (get_rel_relispartition(table_info->relid))
> ancestors = get_partition_ancestors(table_info->relid);
>
> foreach(lc2, ancestors)
> {
> Oid ancestor = lfirst_oid(lc2);
>
> /* Is ancestor exists in the published table list? */
> if (is_ancestor_member_tableinfos(ancestor, table_infos))
> {
> skip = true;
> break;
> }
> }
>
> if (skip)
> table_infos = foreach_delete_current(table_infos, lc);
> }
> }
>
> ~
>
> 2a.
> My previous review [1] (see #1) suggested putting most code within the
> condition. AFAICT my comment still is applicable but was not yet
> addressed.

Personally, I prefer the current style because the approach you mentioned adds
some indentations.

> 2b.
> IMO the comment "/* Is ancestor exists in the published table list?
> */" is unnecessary because it is already clear what is the purpose of
> the function named "is_ancestor_member_tableinfos".

Removed.

> ======
> src/backend/commands/subscriptioncmds.c
>
> 3.
> fetch_table_list(WalReceiverConn *wrconn, List *publications)
> {
> WalRcvExecResult *res;
> - StringInfoData cmd;
> + StringInfoData cmd,
> + pub_names;
> TupleTableSlot *slot;
> Oid tableRow[3] = {TEXTOID, TEXTOID, NAMEARRAYOID};
> List *tablelist = NIL;
> - bool check_columnlist = (walrcv_server_version(wrconn) >= 150000);
> + int server_version = walrcv_server_version(wrconn);
> + bool check_columnlist = (server_version >= 150000);
> +
> + initStringInfo(&pub_names);
> + get_publications_str(publications, &pub_names, true);
>
> initStringInfo(&cmd);
> - appendStringInfoString(&cmd, "SELECT DISTINCT t.schemaname, t.tablename
> \n");
>
> - /* Get column lists for each relation if the publisher supports it */
> - if (check_columnlist)
> - appendStringInfoString(&cmd, ", t.attnames\n");
> + /* Get the list of tables from the publisher. */
> + if (server_version >= 160000)
> + {
>
> ~
>
> I think the 'pub_names' is only needed within that ">= 160000" condition.
>
> So all the below code can be moved into that scope can't it?
>
> + StringInfoData pub_names;
> + initStringInfo(&pub_names);
> + get_publications_str(publications, &pub_names, true);
>
> + pfree(pub_names.data);

Changed.

Also, I've run pgindent for the new patch set.
Attach the new patch set.

Regards,
Wang Wei

Attachment Content-Type Size
HEAD_v20-0001-Fix-data-replicated-twice-when-specifying-publis.patch application/octet-stream 24.3 KB
HEAD_v20-0002-Fix-this-problem-for-back-branches.patch application/octet-stream 3.6 KB
REL15_v20-0001-Fix-data-replicated-twice-when-specifying-publis_patch application/octet-stream 10.0 KB
REL14_v20-0001-Fix-data-replicated-twice-when-specifying-publis_patch application/octet-stream 6.8 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message wangw.fnst@fujitsu.com 2023-03-22 10:09:05 RE: Data is copied twice when specifying both child and parent table in publication
Previous Message Daniel Gustafsson 2023-03-22 10:01:39 Re: Avoid use deprecated Windows Memory API