RE: Data is copied twice when specifying both child and parent table in publication

From: "houzj(dot)fnst(at)fujitsu(dot)com" <houzj(dot)fnst(at)fujitsu(dot)com>
To: Greg Nancarrow <gregn4422(at)gmail(dot)com>, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
Cc: Amit Langote <amitlangote09(at)gmail(dot)com>, Peter Eisentraut <peter(dot)eisentraut(at)enterprisedb(dot)com>, "pgsql-hackers(at)lists(dot)postgresql(dot)org" <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Dilip Kumar <dilipbalaut(at)gmail(dot)com>
Subject: RE: Data is copied twice when specifying both child and parent table in publication
Date: 2021-11-11 06:52:40
Message-ID: OS0PR01MB5716B73A9B3C87466CD08A1D94949@OS0PR01MB5716.jpnprd01.prod.outlook.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Friday, November 5, 2021 11:20 AM Greg Nancarrow <gregn4422(at)gmail(dot)com> wrote:
>On Thu, Nov 4, 2021 at 7:10 PM Amit Kapila <mailto:amit(dot)kapila16(at)gmail(dot)com> wrote:
>>
>> On Thu, Nov 4, 2021 at 12:23 PM Greg Nancarrow <mailto:gregn4422(at)gmail(dot)com> wrote:
>> >
>> > On Thu, Nov 4, 2021 at 3:13 PM Amit Kapila <mailto:amit(dot)kapila16(at)gmail(dot)com> wrote:
>> > >
>> > > On further thinking about this, I think we should define the behavior
>> > > of replication among partitioned (on the publisher) and
>> > > non-partitioned (on the subscriber) tables a bit more clearly.
>> > >
>> > > - If the "publish_via_partition_root" is set for a publication then we
>> > > can always replicate to the table with the same name as the root table
>> > > in publisher.
>> > > - If the "publish_via_partition_root" is *not* set for a publication
>> > > then we can always replicate to the tables with the same name as the
>> > > non-root tables in publisher.
>> > >
>> > > Thoughts?
>> > >
>> >
>> > I'd adjust that wording slightly, because "we can always replicate to
>> > ..." sounds a bit vague, and saying that an option is set or not set
>> > could be misinterpreted, as the option could be "set" to false.
>> >
>> > How about:
>> >
>> > - If "publish_via_partition_root" is true for a publication, then data
>> > is replicated to the table with the same name as the root (i.e.
>> > partitioned) table in the publisher.
>> > - If "publish_via_partition_root" is false (the default) for a
>> > publication, then data is replicated to tables with the same name as
>> > the non-root (i.e. partition) tables in the publisher.
>> >
>>
>> Sounds good to me. If we follow this then I think the patch by Hou-San
>> is good to solve the first problem as described in his last email [1]?
>>
>> [1] - https://www.postgresql.org/message-id/OS0PR01MB5716C756312959F293A822C794869%40OS0PR01MB5716.jpnprd01.prod.outlook.com
>>
>
>Almost.
>The patch does seem to solve that first problem (double publish on tablesync).
>I used the following test (taken from [2]), and variations of it:
>
>However, there did still seem to be a problem, if publish_via_partition_root is then set to false; it seems that can result in
>duplicate partition entries in the pg_publication_tables view, see below (this follows on from the test scenario given above):
>
>postgres=# select * from pg_publication_tables;
> pubname | schemaname | tablename
>---------+------------+-----------
> pub1 | sch1 | tbl1
> pub1 | sch3 | t1
>(2 rows)
>
>postgres=# alter publication pub1 set (publish_via_partition_root=false);
>ALTER PUBLICATION
>postgres=# select * from pg_publication_tables;
> pubname | schemaname | tablename
>---------+------------+------------
> pub1 | sch2 | tbl1_part1
> pub1 | sch2 | tbl1_part2
> pub1 | sch2 | tbl1_part1
> pub1 | sch3 | t1
>(4 rows)
>
>So I think the patch would need to be updated to prevent that.

Thanks for testing the patch.

The reason of the duplicate output is that:
The existing function GetPublicationRelations doesn't de-duplicate the output
oid list. So, when adding both child and parent table to the
publication(pubviaroot = false), the pg_publication_tables view will output
duplicate partition.

Attach the fix patch.
0001 fix data double publish(first issue in this thread)
0002 fix duplicate partition in view pg_publication_tables(reported by greg when testing the 0001 patch)

About the fix for second issue in this thread.
> "I think one possible idea to investigate is that on the
> subscriber-side, after fetching tables, we check the already
> subscribed tables and if the child tables already exist then we ignore
> the parent table and vice versa."

When looking into how to fix the second issue, I have a question:

After changing publish_via_partition_root from false to true, the
subcriber will fetch the partitioned table from publisher when refreshing.

In subsriber side, If all the child tables of the partitioned table already
subscribed, then we can just skip the table sync for the partitioned table. But
if only some of the child tables(not all child tables) were already subscribed,
should we skip the partitioned table's table sync ? I am not sure about the
appropriate behavior here.

What do you think ?

Best regards,
Hou zj

Attachment Content-Type Size
v5-0002-fix-duplicate-table-in-pg_publication_tables.patch application/octet-stream 2.8 KB
v5-0001-fix-data-double-published.patch application/octet-stream 4.3 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Masahiko Sawada 2021-11-11 08:06:31 Re: Parallel vacuum workers prevent the oldest xmin from advancing
Previous Message vignesh C 2021-11-11 06:43:49 Re: Printing backtrace of postgres processes