RE: Data is copied twice when specifying both child and parent table in publication

From: "wangw(dot)fnst(at)fujitsu(dot)com" <wangw(dot)fnst(at)fujitsu(dot)com>
To: Peter Smith <smithpb2250(at)gmail(dot)com>
Cc: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, "osumi(dot)takamichi(at)fujitsu(dot)com" <osumi(dot)takamichi(at)fujitsu(dot)com>, "shiy(dot)fnst(at)fujitsu(dot)com" <shiy(dot)fnst(at)fujitsu(dot)com>, "houzj(dot)fnst(at)fujitsu(dot)com" <houzj(dot)fnst(at)fujitsu(dot)com>, Amit Langote <amitlangote09(at)gmail(dot)com>, Peter Eisentraut <peter(dot)eisentraut(at)enterprisedb(dot)com>, "pgsql-hackers(at)lists(dot)postgresql(dot)org" <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Dilip Kumar <dilipbalaut(at)gmail(dot)com>, Greg Nancarrow <gregn4422(at)gmail(dot)com>, vignesh C <vignesh21(at)gmail(dot)com>
Subject: RE: Data is copied twice when specifying both child and parent table in publication
Date: 2022-08-30 07:43:06
Message-ID: OS3PR01MB62752A88C41234061730EC409E799@OS3PR01MB6275.jpnprd01.prod.outlook.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Tues, Aug 9, 2022 at 15:15 PM Peter Smith <smithpb2250(at)gmail(dot)com> wrote:
> Here are some review comment for the HEAD_v8 patch:

Thanks for your comments.

> 1. Commit message
>
> If there are two publications, one of them publish a parent table with
> (publish_via_partition_root = true) and another publish child table,
> subscribing to both publications from one subscription results in two initial
> replications. It should only be copied once.
>
> SUGGESTION (Note**)
> If there are two publications - one of them publishing a parent table
> (using publish_via_partition_root = true) and the other is publishing
> one of the parent's child tables - then subscribing to both
> publications from one subscription results in the same initial child
> data being copied twice. It should only be copied once.
>
>
> Note** - I've only reworded the original commit message slightly but
> otherwise left it saying the same thing. But I still have doubts if
> this message actually covers all the cases the patch is trying to
> address. e.g. The comment (see below) in the 'fetch_table_list'
> function seemed to cover more cases than what this commit message is
> describing.
> /*
> * Get the list of tables from publisher, the partition table whose
> * ancestor is also in this list will be ignored, otherwise the initial
> * data in the partition table would be replicated twice.
> */

=> commit message
Changed.

=> Note**
I think the commit message and the comment you mentioned refer to the same kind
of scenario.

> 2. src/backend/catalog/pg_publication.c - pg_get_publication_tables
>
> 2a.
> /*
> - * Returns information of tables in a publication.
> + * Returns information of the tables in the given publication array.
> */
>
> What does "information of the tables" actually mean? Is it just the
> names of the tables; is it more than that? IMO a longer, more
> explanatory comment will be better here instead of a brief ambiguous
> one.

Changed as below:
```
Get information of the tables in the given publication array.

Returns the oid, column list, row filter for each table.
```

> 2b.
> + *results = NIL;
>
> This variable name 'results' is too generic, so it is not helpful when
> trying to understand the subsequent code logic. Please give this a
> meaningful name/description.

Changed the variable name as below:
results -> table_infos

> 2c.
> /* Deconstruct the parameter into elements. */
>
> SUGGESTION
> Deconstruct the parameter into elements where each element is a
> publication name.

Changed.

> 2d.
> + List *current_tables = NIL;
>
> I think this is the tables only on the current pub_elem, so I thought
> 'pub_elem_tables' might make it easier to understand this list's
> meaning.

Changed.

> 2e.
> + /* Now sort and de-duplicate the result list */
> + list_sort(tables, list_oid_cmp);
> + list_deduplicate_oid(tables);
>
> IMO this comment is confusing because there is another list that is
> called the 'results' list, but that is not the same list you are
> processing here. Also, it does not really add anything helpful to just
> have comments saying almost the same as the function names
> (sort/de-duplicate), so please give longer comments to say the reason
> *why* the logic does this rather than just describing the steps.

Fixed the comment. ("result" -> "tables")
I think the purpose of these two functions is clear. The reason I added the
comment here is for consistency with other calling locations.

> 2f.
> + /* Filter by final published table */
> + foreach(lc, results)
>
> Please make this comment more descriptive to explain why/what the
> logic is doing.

Changed as below:
```
For tables that have been filtered out, delete the corresponding
table information in the table_infos list.
```

> 3. src/backend/commands/subscriptioncmds.c - fetch_table_list
>
> 3a.
> + bool check_columnlist = (server_version >= 150000);
>
> Given the assignment, maybe 'columnlist_supported' is a better name?

I am not sure if this name could be changed in this thread.

> 3b.
> + /* Get information of the tables belonging to the specified publications */
>
> For "Get information of..." can you elaborate *what* table
> information this is getting and why?

I am not sure if we need to add a reason.
So, I only added what information we are going to get:
```
Get namespace, relname and column list (if supported) of the tables
belonging to the specified publications.
```

> 3c.
> + if (server_version >= 160000)
> + appendStringInfo(&cmd, "SELECT DISTINCT N.nspname, C.relname,\n"
> + " ( CASE WHEN (array_length(GPT.attrs, 1) = C.relnatts)\n"
> + " THEN NULL ELSE GPT.attrs END\n"
> + " ) AS attnames\n"
> + " FROM pg_class C\n"
> + " JOIN pg_namespace N ON N.oid = C.relnamespace\n"
> + " JOIN ( SELECT (pg_get_publication_tables(VARIADIC
> array_agg(pubname::text))).*\n"
> + " FROM pg_publication\n"
> + " WHERE pubname IN ( %s )) as GPT\n"
> + " ON GPT.relid = C.oid\n",
> + pub_names.data);
>
> AFAICT the main reason for this check was to decide if you can use the
> new version of 'pg_get_publication_tables' that supports the VARIADIC
> array of pub names or not. If that is correct, then maybe the comment
> should describe that reason, or maybe add another bool var similar to
> the 'check_columnlist' one for this.

I added the comment as following:
```
From version 16, the parameter of the function pg_get_publication_tables
can be an array of publications. The partition table whose ancestor is
also published in this publication array will be filtered out in this
function.
```

I also rebased the REL_15 patch based on the commit (15014b8), and made some
changes to the back-branch patches based on Peter's suggestions.

Attach the new patches.

Regards,
Wang wei

Attachment Content-Type Size
HEAD_v9-0001-Fix-data-replicated-twice-when-specifying-publish.patch application/octet-stream 18.6 KB
REL15_v9-0001-Fix-data-replicated-twice-when-specifying-publish_patch application/octet-stream 8.4 KB
REL14_v9-0001-Fix-data-replicated-twice-when-specifying-publish_patch application/octet-stream 5.3 KB
REL13_v9-0001-Fix-data-replicated-twice-when-specifying-publish_patch application/octet-stream 5.3 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Benoit Lobréau 2022-08-30 07:49:20 Re: archive modules
Previous Message Kyotaro Horiguchi 2022-08-30 07:39:45 Re: pg_rewind WAL segments deletion pitfall