Data is copied twice when specifying both child and parent table in publication

From: "houzj(dot)fnst(at)fujitsu(dot)com" <houzj(dot)fnst(at)fujitsu(dot)com>
To: "pgsql-hackers(at)lists(dot)postgresql(dot)org" <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Cc: Greg Nancarrow <gregn4422(at)gmail(dot)com>
Subject: Data is copied twice when specifying both child and parent table in publication
Date: 2021-10-15 11:22:50
Message-ID: OS0PR01MB57167F45D481F78CDC5986F794B99@OS0PR01MB5716.jpnprd01.prod.outlook.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

In another logical replication related thread[1], my colleague Greg found that
if publish_via_partition_root is true, then the child table's data will be
copied twice when adding both child and parent table to the publication.

Example:

-----
Pub:
create table tbl1 (a int) partition by range (a);
create table tbl1_part1 partition of tbl1 for values from (1) to (10);
create table tbl1_part2 partition of tbl1 for values from (10) to (20);
create publication pub for table tbl1, tbl1_part1 with (publish_via_partition_root=on);

insert into tbl1_part1 values(1);

Sub:
create table tbl1 (a int) partition by range (a);
create table tbl1_part1 partition of tbl1 for values from (1) to (10);
create table tbl1_part2 partition of tbl1 for values from (10) to (20);
create subscription sub CONNECTION 'dbname=postgres port=10000' publication pub;

-- data is copied twice
select * from tbl1_part1;
a
---
1
1
-----

The reason is that the subscriber will fetch the table list from publisher
using the following sql[2] and the subscriber will execute table
synchronization for each table in the query results in this case. But
tbl1_part1 is a partition of tbl1, so the data of tbl1_part1 was copied twice.

[2]
select * from pg_publication_tables;
pubname | schemaname | tablename
---------+------------+------------
pub | public | tbl1
pub | public | tbl1_part1

IMO, it looks like a bug and it's more natural to only execute the table
synchronization for the parent table in the above case. Because as the document
said: if publish_via_partition_root is true, "changes in a partitioned table
(or on its partitions) contained in the publication will be published using the
identity and schema of the partitioned table rather than that of the individual
partitions that are actually changed;"

To fix it, I think we should fix function GetPublicationRelations which
generate data for the view pg_publication_tables and make it only show the
parent table if publish_via_partition_root is true. And for other future
feature like schema level publication, we can also follow this to exclude
partitions if their parent is specified by FOR TABLE in the same publication.

Attach a patch to fix it.
Thoughts ?

[1] https://www.postgresql.org/message-id/CAJcOf-eBhDUT2J5zs8Z0qEMiZUdhinX%2BbuGX3GN4V83fPnZV3Q%40mail.gmail.com

Best regards,
Hou zhijie

Attachment Content-Type Size
0001-fix-double-publish.patch application/octet-stream 7.2 KB

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Daniel Gustafsson 2021-10-15 11:44:12 Re: Unbounded %s in sscanf
Previous Message torikoshia 2021-10-15 10:12:26 Re: RFC: Logging plan of the running query